Introduction
Data sources
Most of the data in IDC is received from the data collection initiatives/projects supported by US National Cancer Institute. Whenever source images or image-derived data is not in the DICOM format, it is harmonized into DICOM as part of the ingestion.
As of data release v21, IDC sources of data include:
The Cancer Imaging Archive (TCIA) (ongoing)
all DICOM files from the public collections are mirrored in IDC
a subset of digital pathology collections and analysis results harmonized from vendor-specific representation (as available from TCIA) into DICOM Slide Microscopy (SM) format
Childhood Cancer Data Initiative (CCDI) (ongoing)
digital pathology slides harmonized into DICOM SM
The Cancer Genome Atlas (TCGA) slides harmonized into DICOM SM
Human Tumor Atlas Network (HTAN)
release 1 of the HTAN data harmonized into DICOM SM
National Library of Medicine Visible Human Project
v1 of the Visible Human images harmonized into DICOM MR/CT/XC
Genotype-Tissue Expression Project (GTex)
digital pathology slides harmonized into DICOM SM
The list of all of the IDC collections is available in IDC Portal here: https://portal.imaging.datacommons.cancer.gov/collections/.
Data provenance
Whenever IDC replicates data from a publicly available source, we include the reference to the origin:
from the IDC Portal Explore page, click on the "i" icon next to the collection in the collections list

source_doi
metadata column contains Digital Object Identifier (DOI) at the granularity of the individual files and is available both via pythonidc-index
package (see this tutorial on how to access it) and BigQuery interfaces
Check out Data release notes for information about the collections added in the individual IDC data releases.
Data ingestion process
Simplified workflow for IDC data ingestion is summarized in the following diagram.
Last updated
Was this helpful?