IDC User Guide
  • Welcome!
  • 🚀Getting started
  • Core functions
  • Frequently asked questions
  • Support
  • Key pointers
  • Publications
  • IDC team
  • Acknowledgments
  • Jobs
  • Data
    • Introduction
    • Data model
    • Data versioning
    • Organization of data
      • Files and metadata
      • Resolving CRDC Globally Unique Identifiers (GUIDs)
      • Clinical data
      • Organization of data, v2 through V13 (deprecated)
        • Files and metadata
        • Resolving CRDC Globally Unique Identifiers (GUIDs)
        • Clinical data
      • Organization of data in v1 (deprecated)
    • Downloading data
      • Downloading data with s5cmd
    • Data release notes
    • Data known issues
  • Tutorials
    • Portal tutorial
    • Python notebook tutorials
    • Slide microscopy
      • Using QuPath for visualization
  • DICOM
    • Introduction to DICOM
    • DICOM data model
    • Original objects
    • Derived objects
      • DICOM Segmentations
      • DICOM Radiotherapy Structure Sets
      • DICOM Structured Reports
    • Coding schemes
    • DICOM-TIFF dual personality files
    • IDC DICOM white papers
  • Portal
    • Getting started
    • Exploring and subsetting data
      • Configuring your search
      • Exploring search results
      • Data selection and download
    • Visualizing images
    • Proxy policy
    • Viewer release notes
    • Portal release notes
  • API
    • Getting Started
    • IDC API Concepts
    • Manifests
    • Accessing the API
    • Endpoint Details
    • V1 API
      • Getting Started
      • IDC Data Model Concepts
      • Accessing the API
      • Endpoint Details
      • Release Notes
  • Cookbook
    • Colab notebooks
    • BigQuery
    • Looker dashboards
      • Dashboard for your cohort
      • More dashboard examples
    • ACCESS allocations
    • Compute engine
      • 3D Slicer desktop VM
      • Using a BQ Manifest to Load DICOM Files onto a VM
      • Using VS Code with GCP VMs
      • Security considerations
    • NCI Cloud Resources
Powered by GitBook
On this page
  • Data sources
  • Data provenance
  • Data ingestion process

Was this helpful?

Edit on GitHub
Export as PDF
  1. Data

Introduction

PreviousJobsNextData model

Last updated 12 days ago

Was this helpful?

Data sources

Most of the data in IDC is received from the data collection initiatives/projects supported by US National Cancer Institute. Whenever source images or image-derived data is not in the DICOM format, it is harmonized into DICOM as part of the ingestion.

As of data release v21, IDC sources of data include:

    • all DICOM files from the public collections are mirrored in IDC

    • a subset of digital pathology collections and analysis results harmonized from vendor-specific representation (as available from TCIA) into DICOM Slide Microscopy (SM) format

    • digital pathology slides harmonized into DICOM SM

    • The Cancer Genome Atlas (TCGA) slides harmonized into DICOM SM

    • release 1 of the HTAN data harmonized into DICOM SM

    • v1 of the Visible Human images harmonized into DICOM MR/CT/XC

    • digital pathology slides harmonized into DICOM SM

Data provenance

Whenever IDC replicates data from a publicly available source, we include the reference to the origin:

  • from the IDC Portal Explore page, click on the "i" icon next to the collection in the collections list

Data ingestion process

Simplified workflow for IDC data ingestion is summarized in the following diagram.

The list of all of the IDC collections is available in IDC Portal here: .

source_doi metadata column contains Digital Object Identifier (DOI) at the granularity of the individual files and is available both via (see on how to access it) and BigQuery interfaces

Whenever source data is harmonized into DICOM, the DOI will correspond to a Zenodo entry for the result of harmonization, which in turn will reference the location where data can be accessed in the native format (if available). As an example, IDC NLM-Visible-Human-Project collection refers to this DOI that describes the dataset resulting from the original dataset harmonized into DICOM , which in turn references the containing information on accessing the original files collected by the project.

Check out for information about the collections added in the individual IDC data releases.

The Cancer Imaging Archive (TCIA) (ongoing)
Childhood Cancer Data Initiative (CCDI) (ongoing)
Genomic Data Commons (GDC)
Human Tumor Atlas Network (HTAN)
National Library of Medicine Visible Human Project
Genotype-Tissue Expression Project (GTex)
https://portal.imaging.datacommons.cancer.gov/collections/
python idc-index package
this tutorial
https://doi.org/10.5281/zenodo.12690049
NLM Visible Human project page
Data release notes
IDC data ingestion workflow