Introduction

Data sources

Most of the data in IDC is received from the data collection initiatives/projects supported by US National Cancer Institute. Whenever source images or image-derived data is not in the DICOM format, it is harmonized into DICOM as part of the ingestion.

As of data release v21, IDC sources of data include:

The list of all of the IDC collections is available in IDC Portal here: https://portal.imaging.datacommons.cancer.gov/collections/.

Data provenance

Whenever IDC replicates data from a publicly available source, we include the reference to the origin:

  • from the IDC Portal Explore page, click on the "i" icon next to the collection in the collections list

  • source_doi metadata column contains Digital Object Identifier (DOI) at the granularity of the individual files and is available both via python idc-index package (see this tutorial on how to access it) and BigQuery interfaces

Whenever source data is harmonized into DICOM, the DOI will correspond to a Zenodo entry for the result of harmonization, which in turn will reference the location where data can be accessed in the native format (if available). As an example, IDC NLM-Visible-Human-Project collection refers to this DOI that describes the dataset resulting from the original dataset harmonized into DICOM https://doi.org/10.5281/zenodo.12690049, which in turn references the NLM Visible Human project page containing information on accessing the original files collected by the project.

Check out Data release notes for information about the collections added in the individual IDC data releases.

Data ingestion process

Simplified workflow for IDC data ingestion is summarized in the following diagram.

IDC data ingestion workflow

Last updated

Was this helpful?