Introduction
Last updated
Was this helpful?
Last updated
Was this helpful?
Most of the data in IDC is received from the data collection initiatives/projects supported by US National Cancer Institute. Whenever source images or image-derived data is not in the DICOM format, it is harmonized into DICOM as part of the ingestion.
As of data release v21, IDC sources of data include:
all DICOM files from the public collections are mirrored in IDC
a subset of digital pathology collections and analysis results harmonized from vendor-specific representation (as available from TCIA) into DICOM Slide Microscopy (SM) format
digital pathology slides harmonized into DICOM SM
The Cancer Genome Atlas (TCGA) slides harmonized into DICOM SM
release 1 of the HTAN data harmonized into DICOM SM
v1 of the Visible Human images harmonized into DICOM MR/CT/XC
digital pathology slides harmonized into DICOM SM
Whenever IDC replicates data from a publicly available source, we include the reference to the origin:
from the IDC Portal Explore page, click on the "i" icon next to the collection in the collections list
Simplified workflow for IDC data ingestion is summarized in the following diagram.
The list of all of the IDC collections is available in IDC Portal here: .
source_doi
metadata column contains Digital Object Identifier (DOI) at the granularity of the individual files and is available both via (see on how to access it) and BigQuery interfaces
Whenever source data is harmonized into DICOM, the DOI will correspond to a Zenodo entry for the result of harmonization, which in turn will reference the location where data can be accessed in the native format (if available). As an example, IDC NLM-Visible-Human-Project collection refers to this DOI that describes the dataset resulting from the original dataset harmonized into DICOM , which in turn references the containing information on accessing the original files collected by the project.
Check out for information about the collections added in the individual IDC data releases.