Introduction

Data sources

Most of the data in IDC is received from the data collection initiatives/projects supported by US National Cancer Institute. Whenever source images or image-derived data is not in the DICOM format, it is harmonized into DICOM as part of the ingestion.

As of data release v21, IDC sources of data include:

The Cancer Imaging Archive (TCIA) (ongoing)
- all DICOM files from the public collections are mirrored in IDC
- a subset of digital pathology collections and analysis results harmonized from vendor-specific representation (as available from TCIA) into DICOM Slide Microscopy (SM) format
Childhood Cancer Data Initiative (CCDI) (ongoing)
- digital pathology slides harmonized into DICOM SM
Genomic Data Commons (GDC)
- The Cancer Genome Atlas (TCGA) slides harmonized into DICOM SM
Human Tumor Atlas Network (HTAN)
- release 1 of the HTAN data harmonized into DICOM SM
National Library of Medicine Visible Human Project
- v1 of the Visible Human images harmonized into DICOM MR/CT/XC
Genotype-Tissue Expression Project (GTex)
- digital pathology slides harmonized into DICOM SM

The list of all of the IDC collections is available in IDC Portal here: https://portal.imaging.datacommons.cancer.gov/collections/.

Data provenance

Whenever IDC replicates data from a publicly available source, we include the reference to the origin:

from the IDC Portal Explore page, click on the "i" icon next to the collection in the collections list

source_doi metadata column contains Digital Object Identifier (DOI) at the granularity of the individual files and is available both via python idc-index package (see this tutorial on how to access it) and BigQuery interfaces

Whenever source data is harmonized into DICOM, the DOI will correspond to a Zenodo entry for the result of harmonization, which in turn will reference the location where data can be accessed in the native format (if available). As an example, IDC NLM-Visible-Human-Project collection refers to this DOI that describes the dataset resulting from the original dataset harmonized into DICOM https://doi.org/10.5281/zenodo.12690049, which in turn references the NLM Visible Human project page containing information on accessing the original files collected by the project.

Check out Data release notes for information about the collections added in the individual IDC data releases.

Data ingestion process

Simplified workflow for IDC data ingestion is summarized in the following diagram.

PreviousJobs NextData model

Last updated 3 months ago

Was this helpful?