Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
>85 TB of data: IDC contains radiology, brightfield (H&E) and fluorescence slide microscopy images, along with image-derived data (annotations, segmentations, quantitative measurements) and accompanying clinical data
free: all of the data in IDC is publicly available: no registration, no access requests
commercial-friendly: >95% of the data in IDC is covered by the permissive CC-BY license, which allows commercial reuse (small subset of data is covered by the CC-NC license); each file in IDC is tagged with the license to make it easier for you to understand and follow the rules
cloud-based: all of the data in IDC is available from both Google and AWS public buckets: fast and free to download, no out-of-cloud egress fees
harmonized: all of the images and image-derived data in IDC is harmonized into standard DICOM representation
IDC is as much about data as it is about what you can do with the data! We maintain and actively develop a variety of tools that are designed to help you efficiently navigate, access and analyze IDC data:
visualization: examine images and image-derived annotations and analysis results from the convenience of your browser using integrated OHIF, VolView and Slim open source viewers
cohort building: use rich and extensive metadata to build subsets of data programmatically using idc-index
or BigQuery SQL
download: use your favorite S3 API client or idc-index
to efficiently fetch any of the IDC files from our public buckets
If you need support with IDC or have any questions, please open a new topic in (preferred) or send email to support@canceridc.dev.
Would you rather discuss your questions in an meeting with an expert from the IDC team? Book a 1-on-1 support session here:
is a cloud-based environment containing publicly available cancer imaging data co-located with analysis and exploration tools. IDC is a node within the broader NCI infrastructure that provides secure access to a large, comprehensive, and expanding collection of cancer research data.
exploration: start with the to get an idea of the data available
programmatic access: use to perform search, download and other operations programmatically
analysis: conveniently access IDC files and metadata from the tools that are cloud-native, such as Google Colab or Looker; fetch IDC data directly into 3D Slicer using
Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. RadioGraphics (2023).
We ingest and distribute datasets from variety of sources and contributors, primarily focusing on large data collection initiatives sponsored by US National Cancer Institute.
On ingestion, we harmonize images and image-derived data into DICOM format for interoperability, whenever data is represented in a non-DICOM format.
Upon conversion, the data undergoes Extract-Transform-Load (ETL), which extracts DICOM metadata to make the data searchable, ingests the DICOM files into public S3 storage buckets and a DICOMweb store. Once the data is released, we provide various interfaces to access data and metadata.
We are actively developing a variety of capabilities to make it easier for the users to work with the data in IDC. Some of the examples of those tools include
We welcome you to apply to contribute analysis results and annotations of the images available in IDC! These can be expert manual annotations, analysis results generated using AI tools, segmentations, contours, metadata attributes describing the data (e.g., annotation of the scan type), expert evaluation of the quality of existing AI-generated annotations in IDC.
If your contribution is accepted by the IDC stakeholders:
we will work with you to choose the appropriate DICOM object type for your data and convert it into DICOM representation
once published in IDC
your data will become searchable and viewable in IDC Portal, so it is easier for the users of your data to discover and work with your data
files can be downloaded very efficiently using S3 interface and idc-index
At this time, we do not have resources to prioritize receipt of the imaging data from individual PIs (but we are encouraging submissions of annotations/analysis results for existing IDC data!). Nevertheless, if you feel you might have a compelling dataset, please email us at .
provides interactive browser-based interface for exploration of IDC data
we are the maintainers of - an open-source viewer of DICOM digital pathology images; Slim is integrated with IDC Portal for visualizing pathology images and image-derived data available in IDC
we are actively contributing to the , and rely on it for visualizing radiology images and image-derived data
is a python package that provides convenience functions for accessing IDC data, including efficient download from IDC public S3 buckets
extensions can be used for interactive download of IDC data
we are contributing to a variety of tools that aim to simplify the use of DICOM in cancer imaging research; these include and library that can be used for conversion between DICOM Whole Slide Imaging (WSI) format and other slide microscopy formats, library for converting image analysis results to and from DICOM representation
If you would like your annotations/analysis results to be considered, you must establish the value of your contribution (e.g., describe the qualifications of the experts performing manual annotations, demonstrate robustness of the AI tool you are applying to images with a peer-reviewed publication or other type of evidence), and be willing to share your contribution under a permissive Creative Commons Attribution .
See more details on our curation policy , and reach out by sending email to with any questions or inquries. Every application will be reviewed by IDC stakeholders.
upon conversion, we will create a Zenodo entry under the for your contribution so that you get the Digital Object Identifier (DOI), citation and recognition of your contribution
IDC is a component of the broader NCI , giving you access to the following:
can be used to find data related to the images in IDC in , and
Broad and (SB-CGC) can be used to apply analysis tools to the data in IDC (you can read more about how this can be done in from the IDC team)
platform curates a growing number of cancer imaging AI models that can be applied directly to the DICOM data available in IDC
If you need support with IDC or have any questions, please open a new topic in (preferred) or send email to support@canceridc.dev.
Would you rather discuss your questions in an meeting with an expert from the IDC team? Book a 1-on-1 support session here:
If you are an NIH-funded investigator, you can join the that offers significant discounts on the use of cloud resources, and free training courses and materials on the use of the cloud.
Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. RadioGraphics (2023).
Weiss, J., Bernatz, S., Johnson, J., Thiriveedhi, V., Mak, R. H., Fedorov, A., Lu, M. T. & Aerts, H. J. W. Opportunistic assessment of steatotic liver disease in lung cancer screening eligible individuals. J. Intern. Med. (2025).
Thiriveedhi, V. K., Krishnaswamy, D., Clunie, D., Pieper, S., Kikinis, R. & Fedorov, A. Cloud-based large-scale curation of medical imaging data using AI segmentation. Research Square (2024).
Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S., Aerts, H. J. W. L., Homeyer, A., Lewis, R., Akbarzadeh, A., Bontempi, D., Clifford, W., Herrmann, M. D., Höfener, H., Octaviano, I., Osborne, C., Paquette, S., Petts, J., Punzo, D., Reyes, M., Schacherer, D. P., Tian, M., White, G., Ziegler, E., Shmulevich, I., Pihl, T., Wagner, U., Farahani, K. & Kikinis, R. NCI Imaging Data Commons. Cancer Res. 81, 4188–4193 (2021).
Gorman, C., Punzo, D., Octaviano, I., Pieper, S., Longabaugh, W. J. R., Clunie, D. A., Kikinis, R., Fedorov, A. Y. & Herrmann, M. D. Interoperable slide microscopy viewer and annotation tool for imaging data science and computational pathology. Nat. Commun. 14, 1–15 (2023).
Bridge, C. P., Gorman, C., Pieper, S., Doyle, S. W., Lennerz, J. K., Kalpathy-Cramer, J., Clunie, D. A., Fedorov, A. Y. & Herrmann, M. D. Highdicom: a Python Library for Standardized Encoding of Image Annotations and Machine Learning Model Outputs in Pathology and Radiology. J. Digit. Imaging 35, 1719–1737 (2022).
Schacherer, D. P., Herrmann, M. D., Clunie, D. A., Höfener, H., Clifford, W., Longabaugh, W. J. R., Pieper, S., Kikinis, R., Fedorov, A. & Homeyer, A. The NCI Imaging Data Commons as a platform for reproducible research in computational pathology. Comput. Methods Programs Biomed. 107839 (2023). doi:
Krishnaswamy, D., Bontempi, D., Thiriveedhi, V., Punzo, D., Clunie, D., Bridge, C. P., Aerts, H. J., Kikinis, R. & Fedorov, A. Enrichment of the NLST and NSCLC-Radiomics computed tomography collections with AI-derived annotations. arXiv [cs.CV] (2023). at <>
Bontempi, D., Nuernberg, L., Pai, S., Krishnaswamy, D., Thiriveedhi, V., Hosny, A., Mak, R. H., Farahani, K., Kikinis, R., Fedorov, A. & Aerts, H. J. W. L. End-to-end reproducible AI pipelines in radiology using the cloud. Nat. Commun. 15, 6931 (2024).
Krishnaswamy, D., Bontempi, D., Thiriveedhi, V. K., Punzo, D., Clunie, D., Bridge, C. P., Aerts, H. J. W. L., Kikinis, R. & Fedorov, A. Enrichment of lung cancer computed tomography collections with AI-derived annotations. Sci. Data 11, 1–15 (2024).
Murugesan, G. K., McCrumb, D., Aboian, M., Verma, T., Soni, R., Memon, F., Farahani, K., Pei, L., Wagner, U., Fedorov, A. Y., Clunie, D., Moore, S. & Van Oss, J. The AIMI Initiative: AI-Generated Annotations for Imaging Data Commons Collections. arXiv [eess.IV] (2023). at
See the full list, as curated by Google Scholar, .
Pai, S., Bontempi, D., Hadzic, I., Prudente, V., Sokač, M., Chaunzwa, T. L., Bernatz, S., Hosny, A., Mak, R. H., Birkbak, N. J. & Aerts, H. J. W. L. Foundation model for cancer imaging biomarkers. Nature Machine Intelligence 6, 354–367 (2024).
Murugesan, G. K., McCrumb, D., Aboian, M., Verma, T., Soni, R., Memon, F. & Van Oss, J. The AIMI initiative: AI-generated annotations for imaging data commons collections. arXiv [eess.IV] (2023). at <>
Kulkarni, P., Kanhere, A., Yi, P. H. & Parekh, V. S. Text2Cohort: Democratizing the NCI Imaging Data Commons with natural language cohort discovery. arXiv [cs.LG] (2023). at <>
Jiang, P., Sinha, S., Aldape, K., Hannenhalli, S., Sahinalp, C. & Ruppin, E. Big data in basic and translational cancer research. Nat. Rev. Cancer 22, 625–639 (2022).
Schapiro, D., Yapp, C., Sokolov, A., Reynolds, S. M., Chen, Y.-A., Sudar, D., Xie, Y., Muhlich, J., Arias-Camison, R., Arena, S., Taylor, A. J., Nikolov, M., Tyler, M., Lin, J.-R., Burlingame, E. A., Human Tumor Atlas Network, Chang, Y. H., Farhi, S. L., Thorsson, V., Venkatamohan, N., Drewes, J. L., Pe’er, D., Gutman, D. A., Herrmann, M. D., Gehlenborg, N., Bankhead, P., Roland, J. T., Herndon, J. M., Snyder, M. P., Angelo, M., Nolan, G., Swedlow, J. R., Schultz, N., Merrick, D. T., Mazzili, S. A., Cerami, E., Rodig, S. J., Santagata, S. & Sorger, P. K. MITI minimum information guidelines for highly multiplexed tissue images. Nat. Methods 19, 262–267 (2022).
Wahid, K. A., Glerean, E., Sahlsten, J., Jaskari, J., Kaski, K., Naser, M. A., He, R., Mohamed, A. S. R. & Fuller, C. D. Artificial intelligence for radiation oncology applications using public datasets. Semin. Radiat. Oncol. 32, 400–414 (2022).
Hartley, M., Kleywegt, G. J., Patwardhan, A., Sarkans, U., Swedlow, J. R. & Brazma, A. The BioImage Archive - Building a Home for Life-Sciences Microscopy Data. J. Mol. Biol. 167505 (2022). doi:10.1016/j.jmb.2022.167505
Diaz-Pinto, A., Alle, S., Nath, V., Tang, Y., Ihsani, A., Asad, M., Pérez-García, F., Mehta, P., Li, W., Flores, M., Roth, H. R., Vercauteren, T., Xu, D., Dogra, P., Ourselin, S., Feng, A. & Cardoso, M. J. MONAI Label: A framework for AI-assisted interactive labeling of 3D medical images. arXiv [cs.HC] (2022). at <>
We want Imaging Data Commons to be your companion in your cancer imaging research activities - from discovering relevant data to sharing your analysis results and showcasing the tools you developed!
IDC Portal is integrated with powerful visualization tools: just with your web browser you will be able to see IDC images and annotations using OHIF Viewer, Slim viewer and VolView!
We have many tools to help you search data in IDC, so that you download only what you need!
once you have idc-index
python package installed, download from the command line is as easy as running idc download <manifest_file>
, or idc download <collection_id>
.
We want to make it easier to understand performance of the latest advances in AI on real-world cancer imaging data!
With the cloud, you can do things that are simply impossible to do with your local resources.
If you have an algorithm, that you evaluated/published, that can enrich data in IDC with analysis results and you want to contribute those, or if you are a domain expert and would like to publish results of manual annotations you prepared - we want to hear from you!
through a dedicated Zenodo record you will have a citation and DOI to get credit for your work; your data is ingested from Zenodo into IDC, and citation will be generated for the users of your data in IDC
Check out on how to access and use web application that will help you search, subset and visualize data available in IDC.
you can do basic filtering/subsetting of the data using IDC Portal, but if you are developer, you will want to learn how to use for programmatic access. will introduce you to the basics of idc-index
for interaction with IDC content.
search clinical data: many of the IDC collections are accompanied by clinical data, which we parsed for you into searchable tabular representation - no need to download or parse CSV/Excel/PDF files! Dive into searching clinical data using .
if advanced content does not scare you, check out to learn how to search all of the metadata accompanying IDC using SQL and Google BigQuery.
We provide various tools for downloading data from IDC, as discussed in the . Access to all data in IDC is free! No registration. No access request forms. No logins.
looking for an interactive "point-and-click" application? is for you (note that you will only be able to visualize radiology - not microscopy - images in 3D Slicer)
if you have a Google account, you have free access to Google Colab, which allows you to run python notebooks on cloud VMs equipped with GPU - for free! Combined with idc-index
for data access, this makes it rather easy to experiment with the latest AI tools! As an example, take a look at that allows you to apply MedSAM model to IDC data. You will find a growing number of notebooks to help you use IDC in .
use IDC to develop HuggingFace spaces that demonstrate the power of your models on real data: see we developed for SegVol
growing number of AI medical imaging models is being curated on the platform; see to learn how to apply those models on data from IDC
How about accompanying your next publication by a working demonstration notebook on relevant samples from IDC? You can see an example how we did this in .
read to learn how we applied TotalSegmentator+pyradiomics to >126,000 of CT scans of the NLST collection using Terra platform, completing the analysis in ~8 hours with the total cost ~$1000
contains the code we used in the above (this is really advanced content!)
IDC maintains a where we curate contributions of analysis results and other datasets produced by IDC (see the as one example of such contribution)
once your data is in IDC, it should be easier to discover it, combine with other datasets, visualize and use from analysis workflows (as an example, see accompanying the RMS annotations)
email us at to inquire about contributing your annotations/analysis results to IDC!
If you need support with IDC or have any questions, please open a new topic in (preferred) or send email to support@canceridc.dev.
Would you rather discuss your questions in an meeting with an expert from the IDC team? Book a 1-on-1 support session here:
Imaging Data Commons is being developed by a team of engineers and imaging scientists with decades of experience in cancer imaging informatics, cloud computing, imaging standards, security, open source tool development and data sharing.
Our team includes the following sites and project leads:
Brigham and Women's Hospital, Boston, MA, USA (BWH)
Andrey Fedorov, PhD, and Ron Kikinis, MD - Co-PIs of the project
Hugo Aerts, PhD
Cosmin Ciausu, MS
Deepa Krishnaswamy, PhD
Katie Mastrogiacomo
Maria Loy
Institute for Systems Biology, Seattle, WA, USA (ISB)
David Gibbs, PhD - site PI
William Longabaugh, MS
William Clifford, MS
Suzanne Paquette, MS
George White
Ilya Shmulevich, PhD
General Dynamics Information Technology, Bethesda, MD, USA (GDIT)
David Pot, PhD - site PI
Poojitha Gundluru
Fabian Seidl
Prema Venkatesun
Anthony Le
Fraunhofer MEVIS, Bremen, Germany (Fraunhofer MEVIS)
André Homeyer, PhD - site PI
Daniela Schacherer, MS
Henning Höfener, PhD
Massachusetts General Hospital, Boston, MA, USA (MGH)
Chris Bridge, DPhil - site PI
Chris Gorman, PhD
Radical Imaging LLC, Boston, MA, USA (Radical Imaging)
Rob Lewis, PhD - site PI
Igor Octaviano
Pedro Kohler
PixelMed Publishing, Bangor, PA, USA (PixelMed)
David Clunie, MB, BS - site PI
Isomics Inc, Cambridge, MA, USA (Isomics)
Steve Pieper, PhD - site PI
Oversight:
Leidos Biomedical Research
Ulrike Wagner - project manager
Todd Pihl - project manager
National Cancer Institute
Erika Kim - federal lead
Granger Sutton - federal lead
We are grateful to the following individuals who contributed to IDC in the past, but are no longer directly involved in the development of IDC.
Keyvan Farahani (NCI)
Markus Herrmann (MGH)
Davide Punzo (Radical Imaging)
James Petts (Radical Imaging)
Erik Ziegler (Radical Imaging)
Gitanjali Chhetri (Radical Imaging)
Rodrigo Basilio (Radical Imaging)
Jose Ulloa (Radical Imaging)
Madelyn Reyes (GDIT)
Derrick Moore (GDIT)
Mark Backus (GDIT)
Rachana Manandhar (BWH)
Rasmus Kiehl (Fraunhofer MEVIS)
Chad Osborne (GDIT)
Afshin Akbarzadeh (BWH)
Dennis Bontempi (BWH)
Vamsi Thiriveedhi (BWH)
Jessica Cienda (GDIT)
Bernard Larbi (GDIT)
Mi Tian (ISB)
Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.
Check out the Downloading data documentation page!
Note that currently IDC prioritizes submissions from NCI-funded driving projects and data from special selected projects.
If you would like to submit images, it will be your responsibility to de-identify them first, documenting the de-identification process and submitting that documentation for the review by IDC stakeholders.
IDC pilot release took place in Fall 2020, followed by the production release in September 2021.
Please cite the latest paper from the IDC team. Please also make sure you acknowledge the specific data collections you used in your analysis.
IDC and TCIA are partners in providing FAIR data for cancer imaging researchers. While some of the functions between the two resources are similar, there are also key differences. The table below provides a summary of similarities and differences.
Function
IDC
TCIA
De-identification
no, IDC can only host data already de-identified
yes
Cloud-based data co-located with compute resources
yes
no
Conversion of pathology images and image-derived data into DICOM format
yes
no
Private data collections
no
yes
Public data collections
yes
yes
Version control of the data
partial
IDC Portal gives you access to just a small subset of the metadata accompanying IDC images. If you want to learn more about what is available, you have several options:
Let's start with the overall principles of how we organize data in IDC.
IDC brings you (as of v18) over 60 TB of publicly available DICOM images and image-derived content. We share those with you as DICOM files, and those DICOM files are available in cloud-based storage buckets - both in Google and AWS.
Sharing just the files, however, is not particularly helpful. With that much data, it is no longer practical to just download all of those files to later sort through them to select those you need.
Think of IDC as a library, where each file is a book. With that many books, it is not feasible to read them all, or even open each one to understand what is inside. Libraries are of little use without a catalog!
To provide you with a catalog of our data, along with the files, we maintain metadata that makes it possible to understand what is contained within files, and select the files that are of interest for your project, so that you can download just the files you need. We make that metadata available in BigQuery tables searchable using standard SQL.
IDC utilizes BigQuery tables to organize metadata accompanying the files we host. If you have never worked with BigQuery before, you need to understand the basics of data organization in BQ.
BQ tables are organized in BQ datasets. BQ datasets are not unlike folders on your computer, but contain tables related to each other instead of files. BQ datasets, in turn, are organized under Google Cloud projects. GCP projects can be thought of as containers that are managed by a particular organization. To continue with the file system analogy, think about projects as hard drives that contain folders.
Let's map the aforementioned project-dataset-table hierarchy to the concrete locations that contain IDC data.
All of the IDC tables are stored under the bigquery-public-data
project. That project is managed by Google Public Datasets Program, and contains many public BQ datasets, beyond those maintained by IDC.
All of the IDC tables are organized into datasets by data release version. If you complete the tutorial mentioned above, open the BQ console, and scroll down the list of datasets, you will find those that are named starting with the idc_v
prefix - those are IDC datasets.
Following the prefix, you will find the number that corresponds to the IDC data release version. IDC data releases version numbers start from 1 and are incremented by one for each subsequent release. As of writing this, the most recent version of IDC is 16, and you can find dataset idc_v16
corresponding to this version.
Finally, you will also see two special datasets: idc_current
and idc_current_clinical
. Those two datasets are essentially aliases, or links, to the versioned datasets corresponding to the latest release of IDC data.
If you want to explore the latest content of IDC - use current
datasets.
If you want to make sure your queries and data selection are reproducible - always use the version numbered datasets!
BQ views can be very handy when you want to simplify your queries by factoring out the part of the query that is often reused. But a key disadvantage of BQ views over tables is the reduced performance and increased cost due to re-running the query each time you query the view.
As we will discuss further, most of the tables maintained by IDC are created by joining and/or post-processing other tables. Because of this we rely heavily on BQ views to improve transparency of the provenance of those "derived" tables. BQ views can be easily distinguished from the tables in a given dataset by a different icon. IDC datasets also follow a convention that all views in the versioned datasets include suffix _view
in the name, and are accompanied by the result of running the query used by the view in a table that has the same name sans the _view
suffix. See the figure below for an illustration of this convention.
Now that we reviewed the main concepts behind IDC tables organization, it is time to explain the sources of metadata contained in those tables. Leaving _clinical
datasets aside, IDC tables are populated from one of the two sources:
DICOM metadata extracted from the DICOM files hosted by IDC, and various derivative tables that simplify access to specific DICOM metadata items;
collection-level and auxiliary metadata, which is not stored in DICOM tags, but is either received by IDC from other sources, or is populated by IDC as part of data curation (these include Digital Object Identifiers, description of the collections, hashsums, etc).
The set of BQ tables and views has grown over time. The enumeration below documents the BQ tables and views as of IDC v14. Some of these tables will not be found in earlier IDC BigQuery datasets.
dicom_metadata
Each row in the dicom_metadata
table holds the DICOM metadata of an instance in the corresponding IDC version. There is a single row for each DICOM instance in the corresponding IDC version, and the columns correspond to the DICOM attributes encountered in the data across all of the ingested instances.
dicom_metadata
table contains DICOM metadata extract from the files included in the given IDC data release. The amount and variety of the DICOM files grows with the new releases, and the schema of this table reflects the organization of the metadata in each IDC release. Non-sequence attributes, such as Modality
or SeriesInstanceUID
, once encountered in any one file will result in the corresponding column being introduced to the table schema (i.e., if we have column X
in IDC release 11, in all likelihood it will also be present in all of the subsequent releases).
dicom_metadata
can be used to conduct detailed explorations of the metadata content, and build cohorts using fine-grained controls not accessible from the IDC portal. Note that the dicom_all
table, described below, is probably a better choice for such explorations.
Due to the existing limitations of Google Healthcare API, not all of the DICOM attributes are extracted and are available in BigQuery tables. Specifically:
sequences that contain around 1MiB of data are dropped from BigQuery export and RetrieveMetadata output currently. 1MiB is not an exact limit, but it can be used as a rough estimate of whether or not the API will drop the tag (this limitation was not documented as of writing this) - we know that some of the instances in IDC will be affected by this limitation. The fix for this limitation is targeted for sometime in 2021, according to the communication with Google Healthcare support.
auxiliary_metadata
This table defines the contents of the corresponding IDC version. There is a row for each instance in the version. We group the attributes for convenience:
Collection attributes:
tcia_api_collection_id:
The ID, as accepted by the TCIA API, of the original data collection containing this instance (will be Null for collections not sourced from TCIA)
idc_webapp_collection_id:
The ID, as accepted by the IDC web app, of the original data collection containing this instance
collection_id:
The ID, as accepted by the IDC web app. Duplicate of idc_webapp_collection_id
collection_timestamp:
Datetime when the IDC data in the collection was last revised
collection_hash
: md5 hash of the of this version of the collection containing this instance
collection_init_idc_version:
The IDC version in which the collection containing this instance first appeared
collection_revised_idc_version:
The IDC version in which this version of the collection containing this instance first appeared
Patient attributes:
submitter_case_id:
The Patient ID assigned by the submitter of this data. This is the same as the DICOM PatientID
idc_case_id:
IDC generated UUID that uniquely identifies the patient containing this instance
This is needed because DICOM PatientIDs are not required to be globally unique
patient_hash
: md5 hash of this version of the patient/case containing this instance
patient_init_idc_version:
The IDC version in which the patient containing this instance first appeared
patient_revised_idc_version:
The IDC version in which this version of the patient/case containing this instance first appeared
Study attributes:
StudyInstanceUID:
DICOM UID of the study containing this instance
study_uuid:
IDC assigned UUID that identifies a version the the study containing this instance.
study_instances:
The number instances in the study containing this instance
study_hash
: md5 hash of the data in this version of the study containing this instance
study_init_idc_version:
The IDC version in which the study containing this instance first appeared
study_revised_idc_version:
The IDC version in which this version of the study containing this instance first appeared
Series attributes:
SeriesInstanceUID:
DICOM UID of the series containing this instance
series_uuid:
IDC assigned UUID that identifies the version of the series containing this instance
source_doi:
The DOI of an information page corresponding to the original data collection or analysis results that is the source of this instance
source_url:
The URL of an information page that describes the original collection or analysis result that is the source of this instance
series_instances:
The number of instances in the series containing this instance
series_hash
: md5 hash of the data in the this version of the series containing this instance
access:
Collection access status: 'Public' or 'Limited'. (Currently all data is 'Public')
series_init_idc_version:
The IDC version in which the series containing this instance first appeared
series_revised_idc_version:
The IDC version in which this version of the series containing this instance first appeared
Instance attributes:
SOPInstanceUID:
DICOM UID of this instance.
instance_uuid:
IDC assigned UUID that identifies the version of this instance.
gcs_url:
The GCS URL of a file containing the version of this instance that is identified by this series_uuid/instance_uuid
aws_url:
The AWS URL of a file containing the version of this instance that is identified by this series_uuid/instance_uuid
instance_hash
: the md5 hash of this version of this instance
instance_size:
the size, in bytes, of this version of this instance
instance_init_idc_version:
The IDC version in which this instance first appeared
instance_revised_idc_version:
The IDC version in which this version of this instance first appeared
license_url:
The URL of a web page that describes the license governing this version of this instance
license_long_name:
A long form name of the license governing this version of this instance
license_short_name:
A short form name of the license governing this version of this instance
mutable_metadata
Some non-DICOM metadata may change over time. This includes the GCS and AWS URLs of instance data, the accessibility of each instance and the URL of an instance's associated description page. BigQuery metadata tables such as the auxiliary_metadata and dicom_all tables are never revised even when such metadata changes. However, tables in the datasets of previous IDC versions can be joined with the mutable_metadata table to obtain the current values of these mutable attributes.
The table has one row for each version of each instances:
crdc_instance_uuid
: The uuid of an instance version
crdc_series_uuid
: The uuid of a series version that contains this instance version
crdc_study_uuid
: The uuid of a study version that contains the series version
gcs_url
: URL to the Google Cloud Storage (GCS) object containing this instance version
aws_url
: URL to the Amazon Web Services (AWS) object containing this instance version
`access: Current access status of this instance (Public or Limited)
source_url
: The URL of a page that describes the original collection or analysis result that includes this instance
source_doi
: The DOI of a page that describes the original collection or analysis result that includes this instance
original_collections_metadata
tcia_api_collection_id:
The collection ID as is accepted by the TCIA AP
tcia_wiki_collection_id:
The collection ID as on the TCIA wiki page
idc_webapp_collection_id:
The collection ID as accepted by the IDC web app
Program:
The program to which this collection belongs
Updated:
Most recent update date reported by the collection source
Status:
Collection status: "Ongoing" or "Complete"
Access:
Collection access conditions: "Limited" or "Public"
ImageType:
Enumeration of image types/modalities in the collection
Subjects:
Number of subjects in the collection
DOI:
DOI that can be resolved at doi.org to the TCIA wiki page for this collection
URL:
URL of an information page for this collection
CancerType:
Collection source(s) assigned cancer type of this collection
SupportingData:
Type(s) of additional data available
Species:
Species of collection subjects
Location:
Body location that was studied
Description:
Description of the collection (HTML format)
license_url:
The URL of a web page that describes the license governing this collection
license_long_name:
A long form name of the license governing this collection
license_short_name:
A short form name of the license governing this collection
analysis_results_metadata
ID:
Results ID
Title:
Descriptive title
DOI:
DOI that can be resolved at doi.org to the TCIA wiki page for this analysis result
CancerType:
TCIA assigned cancer type of this analysis result
Location:
Body location that was studied
Subjects:
Number of subjects in the analysis result
Collections:
Original collections studied
AnalysisArtifactsonTCIA:
Type(s) of analysis artifacts generated
Updated:
Data when results were last updated
license_url:
The URL of a web page that describes the license governing this collection
license_long_name:
A long form name of the license governing this collection
license_short_name:
A short form name of the license governing this collection
description:
Description of analysis result
version_metadata
Metadata for each IDC version, one row per version:
idc_version
: IDC version number
version_hash
: MD5 hash of hashes of collections in this version
version_timestamp
: Version creation timestamp
The following tables and views consist of metadata derived from one or more other IDC tables tables for convenience of the user. For each such table, <table_name>
, there is also a corresponding view, <table_name>_view
, that, when queried, generates an equivalent table. These views are intended as a reference; each view's SQL is available to be used for further investigation.
dicom_all
, dicom_all_view
All columns from dicom_metadata
together with selected date from the auxiliary_metadata
, original_collections_metadata
, and analysis_results_metadata
tables.
segmentations
, segmentations_view
This table is derived from dicom_all
to simplify access to the attributes of DICOM Segmentation objects available in IDC. Each row in this table corresponds to one DICOM Segmentation instance segment.
measurement_groups
, measurement_groups_view
Each row corresponds to one TID1500 measurement group.
qualitative_measurements
, qualitative_measurements_view
This table is derived from dicom_all
to simplify access to the qualitative measurements in DICOM SR TID1500 objects. It contains coded evaluation results extracted from the DICOM SR TID1500 objects. Each row in this table corresponds to a single qualitative measurement extracted.
quantitative_measurements
, quantitative_measurements_view
This table is derived from dicom_all
to simplify access to the quantitative measurements in DICOM SR TID1500 objects. It contains quantitative evaluation results extracted from the DICOM SR TID1500 objects. Each row in this table corresponds to a single quantitative measurement extracted.
dicom_metadata_curated
, dicom_metadata_curated_view
Curated values of DICOM metadata extracted from dicom_metadata
.
dicom_metadata_curated_series_level
, dicom_metadata_curated_series_level_view
Curated columns from dicom_metadata
that have been aggregated/cleaned up to describe content at the series level. Each row in this table corresponds to a DICOM instance in IDC. The columns are curated by defining queries that apply transformations to the original values of DICOM attributes.
idc_pivot_v<idc version>
A view that is the basis for the queries performed by the IDC web app.
The following tables contain TCGA-specific metadata:
tcga_biospecimen_rel9:
biospecimen metadata
tcga_clinical_rel9:
clinical metadata
nlst_canc
: "Lung Cancer"
nlst_ctab
: "SCT Abnormalities"
nlst_ctabc
: "SCT Comparison Abnormalities"
nlst_prsn
: "Participant"
nlst_screen
: "SCT Screening"
The object namespace is hierarchical, where, for each version of a DICOM instance having instance UUID <instance_uuid> in a version of a series version having UUID <series_uuid>, the file name is:
<series_uuid>/<instance_uuid>.dcm
Corresponding files have the same object name in GCS and S3, though the name of the containing buckets will be different.
Consider an instance in the CPTAC-CM collection that has this SOPInstanceUID
: 1.3.6.1.4.1.5962.99.1.171941254.777277241.1640849481094.35.0
\
It is in a series having this SeriesInstanceUID
:
1.3.6.1.4.1.5962.99.1.171941254.777277241.1640849481094.2.0
The instance and series were added to the IDC Data set in IDC version 7. At that point, the instance was assigned UUID:
5dce0cf0-4694-4dff-8f9e-2785bf179267
and the series was assigned this UUID:
e127d258-37c2-47bb-a7d1-1faa7f47f47a
In IDC version 10, a revision of this instance was added (keeping its original SOPInstanceUID
), and assigned this UUID:
21e5e9ce-01f5-4b9b-9899-a2cbb979b542
Because this instance was revised, the series containing it was implicitly revised. The revised series was thus issued a new UUID:
ee34c840-b0ca-4400-a6c8-c605cef17630
Thus, the initial version of this instance has this file name:
e127d258-37c2-47bb-a7d1-1faa7f47f47a/5dce0cf0-4694-4dff-8f9e-2785bf179267.dcm
and the revised version of the instance has the this file name:
ee34c840-b0ca-4400-a6c8-c605cef17630/21e5e9ce-01f5-4b9b-9899-a2cbb979b542.dcm
Both versions of the instance are in both AWS and GCS buckets.
Note that GCS and AWS bucket names are different. In fact, DICOM instance data is distributed across multiple buckets in both GCS and AWS. We will discuss obtaining GCS and AWS URLs more a little later.
Utilities like gsutil, s3 and s5cmd "understand" the implied hierarchy in these file names. Thus the series UUID now acts like the name of a directory that contains all the instance versions in the series version:
and similarly for AWS buckets, thus making it easy to transfer all instances in a series from the cloud.
Because file names are more or less opaque, the user will not typically select files by listing the contents of a bucket. Instead, one should use either the IDC Portal or IDC BigQuery tables to identify items of interest and, then, generate a manifest of objects that can be passed to a utility like s5cmd.
Most of the data in IDC is received from the data collection initiatives/projects supported by US National Cancer Institute. Whenever source images or image-derived data is not in the DICOM format, it is harmonized into DICOM as part of the ingestion.
As of data release v21, IDC sources of data include:
all DICOM files from the public collections are mirrored in IDC
a subset of digital pathology collections and analysis results harmonized from vendor-specific representation (as available from TCIA) into DICOM Slide Microscopy (SM) format
digital pathology slides harmonized into DICOM SM
The Cancer Genome Atlas (TCGA) slides harmonized into DICOM SM
release 1 of the HTAN data harmonized into DICOM SM
v1 of the Visible Human images harmonized into DICOM MR/CT/XC
digital pathology slides harmonized into DICOM SM
Whenever IDC replicates data from a publicly available source, we include the reference to the origin:
from the IDC Portal Explore page, click on the "i" icon next to the collection in the collections list
Simplified workflow for IDC data ingestion is summarized in the following diagram.
This section describes the current organization of IDC data. The organization of data was static from IDC Version 2 through IDC Version 13 except that was added in Version 11. Development of the clinical data resource is an ongoing project. From IDC v14, our data from the Amazon AWS Open Data Registry, and the files in storage buckets were organized into series-level folders.
Portal:
Discourse (community forum):
Documentation:
GitHub organization:
Tutorials:
: while most of the public DICOM collections from TCIA are available in IDC, we do not replicate limited access TCIA collections
: list curated by Stephen Aylward
: list curated by University College London
: list curated by New York Univestity Health Sciences Library
We gratefully acknowledge and the that support public hosting of IDC-curated content, and cover out-of-cloud egress fees!
Several of the members of the IDC team utilize compute resources supported via the Allocations program, which is in turn funded by the US National Science Foundation. You can see instructions on how to get your allocation at .
We welcome submissions of image-derived data (expert annotations, AI-generated segmentations) for the images already in IDC, see IDC Zenodo community to learn about the requirements for such submissions!
IDC works closely with and mirrors TCIA public collections. If you submit your DICOM data to TCIA and your data is released as a public collection, it will be automatically available in IDC in a following release.
If you are interested in making your data available within IDC, please contact us by sending email to .
IDC data is stored in the cloud buckets, and you can search and for free and without login.
If you would like to use the cloud for analysis of the data, we recommend you start with the free tier of to get free access to a cloud-hosted VM with GPU to experiment with analysis workflows for IDC data. If you are an NIH-funded researcher, you may be eligible for a free allocation via . US-based researchers can also access free cloud-based computing resources via .
We host most of the public collections from . We also host HTAN and other pathology images not hosted by TCIA. You can review the complete, up-to-date list of .
Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W. L., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National cancer institute imaging data commons: Toward transparency, reproducibility, and scalability in imaging artificial intelligence. Radiographics 43, (2023).
The main website for the Cancer Research Data Commons (CRDC) is
Clinical data that was shared by the submitters is available for a number of imaging collections in IDC. Please see on how to search that data and how to link clinical data with imaging metadata!
Many of the imaging collections are also accompanied by the genomics or proteomics data. CRDC provides the API to locate such related datasets.
from our Getting Started tutorial series explains how to use - a python package that aims to simplify access to IDC data
will help you get started with searching IDC metadata in BigQuery, which gives you access to all of the DICOM metadata extracted from IDC-hosted files
if you are not comfortable writing queries or coding in pyhon, you can use to search using some of the attributes that are not available through the portal. You can also to include additional attributes.
We gratefully acknowledge and the that support public hosting of IDC-curated content, and cover out-of-cloud egress fees!
In the following we describe organization of both the metadata catalog and the buckets containing the files. As you go over this documentation, please consider completing our - it will give you the opportunity to apply the knowledge you gain by reading this article while interacting with the data, and should help better understand this content.
Google is a massively-parallel analytics engine ideal for working with tabular data. Data stored in BQ can be accessed using queries.
This may be a good time for you to complete , so that you are able to open the tables and datasets we will be discussing in the following paragraphs!
In addition to idc_v16
you will find a dataset named idc_v16_clinical
. That dataset contains clinical data accompanying IDC collections. We started clinical data ingestion in IDC v11. If you want to learn more about the organization and searching of clinical data, take a look at the .
Before we dive into discussing the individual tables maintained by IDC, there is just one more BigQuery-specific concept you need to learn: the view. BigQuery view is a table that is defined by an SQL query that is run every time you query the view (you can read more about BQ views in ).
If you are ever curious (and you should be, at least once in a while!) about the queries behind individual views, you can click on the view in the BQ console, and see the query in the "Details" tab. Try this out yourself to check the query for
Table in BigQuery:
IDC utilizes the standard capabilities of the Google Healthcare API to extract all of the DICOM metadata from the hosted collections into a single BQ table. Conventions of how DICOM attributes of various types are converted into BQ form are covered in the Google Healthcare API documentation article.
Sequence DICOM attributes, however, may have content that is highly variable across different DICOM instances (especially in Structured Reports). Those attributes will map to , and it is not unusual to see drastic differences in the corresponding columns of the table between different releases.
sequences that have more than 15 levels of nesting are not extracted (see ) - we believe this limitation does not affect the data stored in IDC
Table in BigQuery:
Table in BigQuery:
Table in BigQuery:
This table is comprised of IDC data collection-level metadata for the original TCIA data collections hosted by IDC, for the most part corresponding to the content available in . One row per collection:
Table in BigQuery:
Metadata for the TCIA analysis results hosted by IDC, for the most part corresponding to the content available in . One row per analysis result:
Table in BigQuery:
Several of these tables/views are discussed more completely .
Table in BigQuery:
Table in BigQuery:
Table in BigQuery:
Table in BigQuery:
Table in BigQuery:
Table in BigQuery:
This table is derived from dicom_all
to simplify access to the measurement groups encoded in DICOM Structured Report TID 1500 objects available in IDC. Specifically, this table contains measurement groups corresponding to the "Measurement group" content item in the DICOM SR objects.
Table in BigQuery:
Table in BigQuery:
Table in BigQuery:
Table in BigQuery:
Table in BigQuery:
Table in BigQuery:
Table in BigQuery:
Table in BigQuery:
Most clinical data is found in the . However, a few tables of clinical data are found in the idc_v<idc_version> datasets.
IDC hosts a subset of the NLST clinical data, which was cleared for public sharing. If you need the full clinical data, please visit the .
The following tables contain NLST specific metadata. The detailed schema of those tables is available from the .
Storage Buckets are basic containers in Google Cloud Storage and AWS S3 that provide storage for data objects (you can read more about the relevant terms in the Google Cloud Storage documentation and in S3 ).
All IDC DICOM file data for all IDC data versions across all of the are maintained in Google Cloud Storage (GCS) and AWS S3 (S3) buckets. Currently all DICOM files are maintained in buckets that allow for free egress within or out of the cloud. This is enabled through the partnership of IDC with and the .
Note that only (versions of) DICOM instances have associated files (as discussed in . There are no per-series or per-study files.
IDC utilizes a single Google Healthcare DICOM store to host all of the instances in the current IDC version. That store, however, is primarily intended to support visualization of the data using the OHIF and Slim viewers. At this time, we do not support access of the hosted data via DICOMWeb interface by IDC users. See more details in the , and please comment about your use case if you have a need to access data via the DICOMweb interface.
The list of all of the IDC collections is available in IDC Portal here: .
source_doi
metadata column contains Digital Object Identifier (DOI) at the granularity of the individual files and is available both via (see on how to access it) and BigQuery interfaces
Whenever source data is harmonized into DICOM, the DOI will correspond to a Zenodo entry for the result of harmonization, which in turn will reference the location where data can be accessed in the native format (if available). As an example, IDC NLM-Visible-Human-Project collection refers to this DOI that describes the dataset resulting from the original dataset harmonized into DICOM , which in turn references the containing information on accessing the original files collected by the project.
Check out for information about the collections added in the individual IDC data releases.
IDC V14 introduced important enhancements to IDC data organization. The discussion of the organization of data in earlier versions is preserved here.
By clinical data we refer to the broad spectrum of image-related data that may accompany images. Such data may include demographics of the patients, observations related to their clinical history (therapies, diagnoses, findings), lab tests, surgeries.
Not only the terms used in the clinical data accompanying individual collection are not harmonized, but the format of the spreadsheets is also collection-specific. In order to search and navigate clinical data, one has to parse those collection specific tables, and there is no interface to support searching across collections.
collection_id
(STRING, NULLABLE) - the collection_id of the collection in the given table. The collection id is in a format used internally by the IDC Web App (with only lowercase letters, numbers and '_' allowed). It is equivalent to the idc_webapp_id
field in the dicom_all
view in the idc_current
dataset.
table_name
(STRING,NULLABLE) - name of the table
table_description
(STRING,NULLABLE) - description of the type of data found in the table. Usually this is set to 'clinical data', unless a description is provided in the source files
idc_version_table_added
(STRING, NULLABLE) - the IDC data version for which this table was first added
idc_table_added_datetime
(STRING,NULLABLE) - the date/time this particular table was first generated
post_process_src
(STRING, NULLABLE) - except for the CPTAC and TCGA collections the tables are curated from ZIP, Excel, and CSV files downloaded from the TCIA wiki. These files do not have a consistent structure and were not meant to be machine readable or to translate directly into BigQuery. A semi-manual curation process results in either a CSV of JSON file that can be directly written into a BigQuery table. post_process_src is the name of the JSON or CSV file that results from this process and is used to create the BigQuery table. This field is not used for the CPTAC- and TCGA-related tables
post_process_src_add_md5
(STRING, NULLABLE) - the md5 hash of post_process_src when the table was first added
idc_version_table_prior
(STRING, NULLABLE) - the idc version the second most recent time the table was updated
post_process_src_prior_md5
(STRING, NULLABLE) - the md5 hash of post_process_src the second most recent time the table was updated
idc_version_table_updated
(STRING, NULLABLE) - the idc version when the table was last updated
table_update_datetime
(STRING, NULLABLE) - date and time an update of the table was last recorded
post_process_src_updated_md5
(STRING, NULLABLE) - the md5 hash of post_process_source when the table was last updated
number_batches
(INTEGER, NULLABLE) - records the number of batches. Within the source data patients are sometimes grouped into different 'batches' (i.e. training vs test, responder vs non-responder etc.) and the batches are placed in different locations (i.e. different files or different sheets in the same Excel file)
source_info
(RECORD, REPEATED) - an array of records with information about the table sources. These sources are either files downloaded from the TCIA wiki or another BigQuery table (as is the case for CPTAC and TCGA collections). There is a source_info record for each source 'batch' described above
source_info.srcs
(STRING, REPEATED) - a source file downloaded from the TCIA wiki may be a ZIP file, and CSV file, or an Excel file. Sometimes the ZIP files contain other ZIP files that must be opened to extract the clinical data. In the source_info.src
array the first string is the file that is downloaded from TCIA for this particular source batch. The final string is the CSV or Excel file that contains the clinical data. Any intermediate strings are the names of ZIP files 'in between' the downloaded file and the clinical file. For CPTAC and TCGA collections this field contains the source BigQuery table
source_info.md5
(STRING, NULLABLE) - md5 hash of the downloaded file from TCIA the most recent time the table was updated
source_info.table_last_modified
(STRING, NULLABLE) - CPTAC and TCGA collections only. The date and time the source BigQuery table was most recently modified, as recorded when last copied
source_info.table_size
(STRING, NULLABLE) - CPTAC and TCGA collections only. The size of the source BigQuery table as recorded when last copied
collection_id
(STRING,NULLABLE) - the collection_id of the collection in the given table. The collection id is in a format used internally by the IDC Web App (with only lowercase letters, numbers and '_' allowed). It is equivalent to the idc_webapp_id
field in the dicom_all
view in the idc_current
dataset.
case_col
(BOOLEAN, NULLABLE) - true if the BigQuery column contains the patient or case id, i.e. if this column is used to determine the value of the dicom_patient_id
column
table_name
(STRING, NULLABLE) - table name
column
(STRING, NULLABLE) - the actual column name in the table. For ACRIN collections the column_name
is the variable_name
from the provided data dictionary. For other collections it is a name constructed by 'normalizing' the column_label
(see next) in a format that can be used as a BigQuery field name
column_label
(STRING, NULLABLE) - a 'free form' label for the column that does not need to conform to the BigQuery column format requirements. For ACRIN collections this is the variable_label
given by a data dictionary that accompanies the collection. For other collections it is the name or label of the clinical attribute as inferred from the source document during the curation process
data_type
(STRING, NULLABLE) - the type of data in this column. Again for ACRIN collections this is provided in the data dictionary. For other collections it is inferred by analyzing the data during curation
original_column_headers
(STRING, REPEATED) - the name(s) or label(s) in the source document that were used to construct the column_label
field. In most cases there is one column label in the source document that perscribes the column_label
. In some cases, multiple columns are concantenated and reformated to form the column_label
values
(RECORD, REPEATED) - a structure that is borrowed from the ACRIN data model. This is an array that contains observerd attribute values for this given column. For ACRIN collections these values are reported in the data dictionary. For most other collections these values are determined by analyzing the source data. For simplicity this field is left blank when the number of unique values is greater than 20
values.option_code
(STRING, NULLABLE) - a unique attribute value found in this column
values.option_description
(STRING, NULLABLE) - a description of the option_code
as provided by a data dictionary. For collections that do not have a data dictionary this is null.
values_source
(STRING, NULLABLE) - indicates the source of the values
records. The text 'provided dictionary' indicates that the records were obtained from a provided data dictionary. The text 'derived from inspection of values' indicates that the records were determined by automated analysis of the source materials during the ETL process that generated the BigQuery tables.
files
(STRING, REPEATED) - names of the files that contain the source data for each batch. These are the Excel or CSV files directly downloaded from TCIA, or the files extracted from downloaded ZIP files
sheet_names
(STRING, REPEATED) - for Excel-sourced files, the sheet names containing this column's values for each batch
batch
(INTEGER, REPEATED) - source batches that contain this particular column. Some columns or attributes may be missing from some batches
column_numbers
(STRING, REPEATED) - for each source batch, the column in the original source corresponding to this column in the BigQuery table
IDC content is organized in Collections: groups of DICOM files that were collected through certain research activity.
Individual DICOM files included in the collection contain attributes that organize content according to the DICOM data model.
Each collection will contain data for one or more case, or patient. Data for the individual patient is organized in DICOM studies, which group images corresponding to a single imaging exam/enconter, and collected in a given session. Studies are composed of DICOM series, which in turn consist of DICOM instances. Each DICOM instance correspond to a single file on disk. As an example, in radiology imaging, individual instances would correspond to image slices in multi-slice acquisitions, and in digital pathology you will see a separate file/instance for each resolution layer of the image pyramid. When using IDC Portal, you will never encounter individual instances - you will only see them if you download data to your computer.
Analysis results collection is a very important concept in IDC. These contain analysis results that were not contributed as part of any specific collection. Such analysis results might be contributed by investigators unrelated to those that submitted the analyzed images, and may span images across multiple collections.
When you work with IDC data at any given time, you should be aware of the data release version. If you build cohorts using filters or queries, the result of those queries will change as the IDC content is evolving. Building queries that refer to the specific data release version will ensure that the result is the same.
Here is how you can learn what version of IDC data you are interacting with, depending on what interface to the data you are using:
IDC Portal: data version and release date are displayed in the summary strip
idc-index: use get_idc_version()
function
3D Slicer / SlicerIDCBrowser: version information is provided in the SlicerIDCBrowser module top panel, and in the pop-up window title.
The IDC obtains curated DICOM radiology, pathology and microscopy image and analysis data from The Cancer Imaging Archive (TCIA) and additional sources. Data from all these sources evolves over time as new data is added (common), existing files are corrected (rare), or data is removed (extremely rare).
Users interact with IDC using one of the following interfaces to define cohorts, and then perform analyses on these cohorts:
The goal of IDC versioning is to create a series of "snapshots” over time of the entirety of the evolving IDC imaging dataset, such that searching an IDC version according to some criteria (creating a cohort) will always identify exactly the same set of objects. Here “identify” particularly means providing URLs or other access methods to the corresponding physical data objects.
In order to reproduce the result of such analysis, it must be possible to precisely recreate a cohort. For this purpose an IDC cohort as defined in the Portal is specified and saved as a filter applied against a specified IDC data version. Alternatively, the cohort can be defined as an SQL query, or as a list of unique identifiers selecting specific files within a defined data release version.
Because an IDC version exactly defines the set of data against which the filter/query is applied, and because all versions of all data, except data removed due to PHI/PII concerns, should continue to be available, a cohort is therefore persistent over the course of the evolution of IDC data.
There are various reasons that can cause modification of the existing collections in IDC:
images for new patients can be added to an existing collections;
additional DICOM series are sometimes added to a DICOM study over time (i.e., those that contain new annotations or analysis results);
a series may be added or removed from an existing study;
metadata of an existing instance might be corrected (which may or may not lead to an update of the DICOM SOPInstanceUID
corresponding to the instance).
These and other possible changes mean that DICOM instances, series and studies can change from one IDC data version to the next, while their DICOM UIDs remain unchanged. This motivates the need for maintaining versioning of the DICOM entities.
The data in each IDC version, then, can be thought of as some set of versioned DICOM instances, series and studies. This set is defined in terms of the corresponding set of instance UUIDs, series UUIDs and study UUIDs. This means that if, e.g., some version of an instance having UUID UUIDx that was in IDC version Vm is changed, a new UUID, UUIDy, will be assigned to the new instance version. Subsequent IDC versions, Vm+1, Vm+2, ... will include that new instance version identified by UUIDy unless and until that instance is again changed. Similarly if the composition of some series changes, either because an instance in the series is changed, or an instance is added or removed from that series, a new UUID is assigned to the new version of that series and identifies that version of the series in subsequent IDC versions. Similarly, a study is assigned a new UUID when its composition changes.
A corollary is that only a single version of an instance, series or study is in an IDC version.
Note that instances, series and studies do not have an explicit version number in their metadata. Versioning of an object is implicit in the associated UUIDs.
This is a typical IDC instance UUID:
641121f1-5ca0-42cc-9156-fb5538c14355
of a (version of a) DICOM instance, and this is the corresponding DRS ID:
dg.4DFC/641121f1-5ca0-42cc-9156-fb5538c14355
A DRS ID can be resolved by appending it to the following URL, which is the resolution service within CRDC: https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects/
. For example, the following curl
command:
>> curl https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects/dg.4DFC/641121f1-5ca0-42cc-9156-fb5538c14355
returns this DrsObject:
AS can be seen, the access_methods
component in the returned DrsObject includes a URL for each of the corresponding files in Google GCS and AWS S3.
v1 of IDC followed a different layout of data than subsequent version. Since the corresponding items are still available, we document it here for reference.
Storage buckets are named using the format idc-tcia-<TCIA_COLLECTION_NAME>
, where TCIA_COLLECTION_NAME
corresponds to the collection name in the collections table here.
Within the bucket, DICOM files are organized using the following directory naming conventions:
dicom/<StudyInstanceUID>/<SeriesInstanceUID>/<SOPInstanceUID>.dcm
where *InstanceUID
s correspond to the respective value of the DICOM attributes in the stored DICOM files.
Egress of IDC data out of the cloud is free, since IDC data is participating in Google Public Datasets Program!
Due to the existing limitations of Google Healthcare API, not all of the DICOM attributes are extracted and are available in BigQuery tables. Specifically:
sequences that contain around 1MiB of data are dropped from BigQuery export and RetrieveMetadata output currently. 1MiB is not an exact limit, but it can be used as a rough estimate of whether or not the API will drop the tag (this limitation was not documented as of writing this) - we know that some of the instances in IDC will be affected by this limitation. The fix for this limitation is targeted for sometime in 2021, according to the communication with Google Healthcare support.
IDC users can access this table to conduct detailed exploration of the metadata content, and build cohorts using fine-grained controls not accessible from the IDC portal.
In addition to the DICOM metadata tables, we maintain several additional tables that curate metadata non-DICOM metadata (e.g., attribution of a given item to a specific collection and DOI, collection-level metadata, etc).
In addition to the DICOM data, some of the image-related data hosted by IDC is stored in additional tables. These include the following:
Depending on whether you would like to download data interactively or programmatically, we provide two recommended tools to help you.
With the idc-index
package you get command line scripts that aim to make download simple.
Have a .s5cmd manifest file you downloaded from IDC Portal or from the records in the IDC Zenodo community? Get the corresponding files as follows (you will also get download progress bar and the downloaded files will be organized in the collection/patient/study/series folder hierarchy!):
You can use the same command to download files corresponding to any collection, patient, study or series, referred to by the identifiers you can copy from the portal!
Similarly, you can copy identifiers for patient/study/series and download the corresponding content!
Once installed, you can use SlicerIDCBrowser in one of the two modes:
As an interface to explore IDC data: you can select individual collections, cases and DICOM studies and download items of interest directly into 3D Slicer for subsequent visualization and analysis.
As download tool: download IDC content based on the manifest you created using IDC Portal, or identifiers of the individual cases, DICOM studies or series.
In a future release of IDC we will by default exclude limited access items from what you select in the portal, so the portal selection should be more intuitive. But if you access the data via BigQuery queries you will need to know that “Limited” are not accessible and account for this in your query.Storage Buckets
The flat address space of IDC DICOM objects in GCS storage is accompanied by BigQuery tables that allow the researcher to reconstruct the DICOM hierarchy as it exists for any given version. There are also several BQ tables and views in which we keep copies of the metadata exposed via the TCIA interface at the time a version was captured and other pertinent information.
There is an instance of each of the following tables and views per IDC version. The set of tables and views corresponding to an IDC version are collected in a single BQ dataset per IDC version, bigquery-public-data.idc_<idc_version_number>
where bigquery-public-data
is the project in which the dataset is hosted. As an example, the BQ tables for IDC version 4 are in the bigquery-public-data.idc_v4
dataset.
In addition to the per-version datasets, the bigquery-public-data.idc-current
dataset consists of a set of BQ views. There is a view for each table or view in the BQ data set corresponding to the current IDC release. Each such view in bigquery-public-data.idc-current
is named identically to some table or view in the bigquery-public-data.idc_<idc_version_number> dataset of the current IDC release and can be used to access that table or view.
Several Google BigQuery (BQ) tables support searches against metadata extracted from the data files. Additional BQ tables define the composition of each IDC data version.
We maintain several additional tables that curate metadata non-DICOM metadata (e.g., attribution of a given item to a specific collection and DOI, collection-level metadata, etc).
tcia_api_collection_id:
The ID, as accepted by the TCIA API, of the original data collection containing this instance
idc_webapp_collection_id:
The ID, as accepted by the IDC web app, of the original data collection containing this instance
collection_timestamp:
Datetime when the IDC data in the collection was last revised
source_doi:
A DOI of the TCIA wiki page corresponding to the original data collection or analysis results that is the source of this instance
collection_hash
: The md5 hash of the sorted patient_hashes
of all patients in the collection containing this instance
collection_init_idc_version:
The IDC version in which the collection containing this instance first appeared
collection_revised_idc_version:
The IDC version in which the collection containing this instance was most recently revised
Patient attributes:
submitter_case_id:
The submitter’s (of data to TCIA) ID of the patient containing this instance. This is the DICOM PatientID
idc_case_id:
IDC generated UUID that uniquely identifies the patient containing this instance
This is needed because DICOM PatientIDs are not required to be globally unique
patient_hash
: the md5 hash of the sorted study_hashes
of all studies in the patient containing this instance
patient_init_idc_version:
The IDC version in which the patient containing this instance first appeared
patient_revised_idc_version:
The IDC version in which the patient containing this instance was most recently revised
Study attributes:
StudyInstanceUID:
DICOM UID of the study containing this instance
study_uuid:
IDC assigned UUID that identifies a version of the study containing this instance.
study_instances:
The number instances in the study containing this instance
study_hash
: the md5 hash of the sorted series_hashes
of all series in study containing this instance
study_init_idc_version:
The IDC version in which the study containing this instance first appeared
study_revised_idc_version:
The IDC version in which the study containing this instance was most recently revised
Series attributes:
SeriesInstanceUID:
DICOM UID of the series containing this instance
series_uuid:
IDC assigned UUID that identifies a version of the series containing this instance
source_doi:
A DOI of the TCIA wiki page corresponding to the original data collection or analysis results that is the source of this instance
series_instances:
The number of instances in the series containing this instance
series_hash
: the md5 hash of the sorted instance_hashes
of all instance in the series containing this instance
series_init_idc_version:
The IDC version in which the series containing this instance first appeared
series_revised_idc_version:
The IDC version in which the series containing this instance was most recently revised
Instance attributes:
SOPInstanceUID:
DICOM UID of this instance.
instance_uuid:
IDC assigned UUID that identifies a version of this instance.
gcs_url:
The GCS URL of a file containing the version of this instance that is identified by the instance_uuid
instance_hash
: the md5 hash of the version of this instance that is identified by the instance_uuid
instance_size:
the size, in bytes, of this version of the instance that is identified by the instance_uuid
instance_init_idc_version:
The IDC version in which this instance first appeared
instance_revised_idc_version:
The IDC version in which this instance was most recently revised
license_url:
The URL of a web page that describes the license governing this instance
license_long_name:
A long form name of the license governing this instance
license_short_name:
A short form name of the license governing this instance
Due to the existing limitations of Google Healthcare API, not all of the DICOM attributes are extracted and are available in BigQuery tables. Specifically:
sequences that contain around 1MiB of data are dropped from BigQuery export and RetrieveMetadata output currently. 1MiB is not an exact limit, but it can be used as a rough estimate of whether or not the API will drop the tag (this limitation was not documented as of writing this) - we know that some of the instances in IDC will be affected by this limitation. The fix for this limitation is targeted for sometime in 2021, according to the communication with Google Healthcare support.
tcia_api_collection_id:
The collection ID as is accepted by the TCIA AP
tcia_wiki_collection_id:
The collection ID as on the TCIA wiki page
idc_webapp_collection_id:
The collection ID as accepted by the IDC web app
Program:
The program to which this collection belongs
Updated:
Moser recent update date reported by TCIA
Status:
Collection status" Ongoing or complete
Access:
Collection access conditions: Limited or Public
ImageType:
Enumeration of image types/modalities in the collection
Subjects:
Number of subjects in the collection
DOI:
DOI that can be resolved at doi.org to the TCIA wiki page for this collection
CancerType:
TCIA assigned cancer type of this collection
SupportingData:
Type(s) of additional data available
Species:
Species of collection subjects
Location:
Body location that was studied
Description:
TCIA description of the collection (HTML format)
license_url:
The URL of a web page that describes the license governing this collection
license_long_name:
A long form name of the license governing this collection
license_short_name:
A short form name of the license governing this collection
ID:
Results ID
Title:
Descriptive title
DOI:
DOI that can be resolved at doi.org to the TCIA wiki page for this analysis result
CancerType:
TCIA assigned cancer type of this analysis result
Location:
Body location that was studied
Subjects:
Number of subjects in the analysis result
Collections:
Original collections studied
AnalysisArtifactsonTCIA:
Type(s) of analysis artifacts generated
Updated:
Data when results were last updated
license_url:
The URL of a web page that describes the license governing this collection
license_long_name:
A long form name of the license governing this collection
license_short_name:
A short form name of the license governing this collection
cancer-idc.idc_v<version_number>.version_metadata
(also available via the canceridc-data.idc-current.version_metadata view for the current version of IDC data). Metadata for each IDC version, one row per row:
idc_version: IDC version number
version_hash: MD5 hash of hashes of collections in this version
version_timestamp: Version creation timestamp
view for the current version of IDC data) Measurement group sequences extracted from the DICOM SR TID1500 objects
The following tables contain TCGA-specific metadata:
tcga_biospecimen_rel9:
biospecimen metadata
tcga_clinical_rel9:
clinical metadata
Some of the collections are accompanied by BigQuery tables that have not been harmonized to a single data model. Those tables are available within the BigQuery dataset corresponding to a given release, and will have the name prefix corresponding to the short name of the collection. The list below discusses those collection-specific tables.
In addition to the DICOM data, some of the image-related data hosted by IDC is stored in additional tables. These include the following:
By clinical data we refer to the broad spectrum of image-related data that may accompany images. Such data may include demographics of the patients, observations related to their clinical history (therapies, diagnoses, findings), lab tests, surgeries.
Not only are the terms used in the clinical data accompanying individual collection not harmonized, but the format of the spreadsheets is also collection-specific. In order to search and navigate clinical data, one has to parse those collection specific tables, and there is no interface to support searching across collections.
collection_id
(STRING, NULLABLE) - the collection_id of the collection in the given table. The collection id is in a format used internally by the IDC Web App (with only lowercase letters, numbers and '_' allowed). It is equivalent to the idc_webapp_id
field in the dicom_all
view in the idc_current
dataset.
table_name
(STRING,NULLABLE) - name of the table
table_description
(STRING,NULLABLE) - description of the type of data found in the table. Usually this is set to 'clinical data', unless a description is provided in the source files
idc_version_table_added
(STRING, NULLABLE) - the IDC data version for which this table was first added
idc_table_added_datetime
(STRING,NULLABLE) - the date/time this particular table was first generated
post_process_src
(STRING, NULLABLE) - except for the CPTAC and TCGA collections the tables are curated from ZIP, Excel, and CSV files downloaded from the TCIA wiki. These files do not have a consistent structure and were not meant to be machine readable or to translate directly into BigQuery. A semi-manual curation process results in either a CSV of JSON file that can be directly written into a BigQuery table. post_process_src is the name of the JSON or CSV file that results from this process and is used to create the BigQuery table. This field is not used for the CPTAC- and TCGA-related tables
post_process_src_add_md5
(STRING, NULLABLE) - the md5 hash of post_process_src when the table was first added
idc_version_table_prior
(STRING, NULLABLE) - the idc version the second most recent time the table was updated
post_process_src_prior_md5
(STRING, NULLABLE) - the md5 hash of post_process_src the second most recent time the table was updated
idc_version_table_updated
(STRING, NULLABLE) - the idc version when the table was last updated
table_update_datetime
(STRING, NULLABLE) - date and time an update of the table was last recorded
post_process_src_updated_md5
(STRING, NULLABLE) - the md5 hash of post_process_source when the table was last updated
number_batches
(INTEGER, NULLABLE) - records the number of batches. Within the source data patients are sometimes grouped into different 'batches' (i.e. training vs test, responder vs non-responder etc.) and the batches are placed in different locations (i.e. different files or different sheets in the same Excel file)
source_info
(RECORD, REPEATED) - an array of records with information about the table sources. These sources are either files downloaded from the TCIA wiki or another BigQuery table (as is the case for CPTAC and TCGA collections). There is a source_info record for each source 'batch' described above
source_info.srcs
(STRING, REPEATED) - a source file downloaded from the TCIA wiki may be a ZIP file, and CSV file, or an Excel file. Sometimes the ZIP files contain other ZIP files that must be opened to extract the clinical data. In the source_info.src
array the first string is the file that is downloaded from TCIA for this particular source batch. The final string is the CSV or Excel file that contains the clinical data. Any intermediate strings are the names of ZIP files 'in between' the downloaded file and the clinical file. For CPTAC and TCGA collections this field contains the source BigQuery table
source_info.md5
(STRING, NULLABLE) - md5 hash of the downloaded file from TCIA the most recent time the table was updated
source_info.table_last_modified
(STRING, NULLABLE) - CPTAC and TCGA collections only. The date and time the source BigQuery table was most recently modified, as recorded when last copied
source_info.table_size
(STRING, NULLABLE) - CPTAC and TCGA collections only. The size of the source BigQuery table as recorded when last copied
collection_id
(STRING,NULLABLE) - the collection_id of the collection in the given table. The collection id is in a format used internally by the IDC Web App (with only lowercase letters, numbers and '_' allowed). It is equivalent to the idc_webapp_id
field in the dicom_all
view in the idc_current
dataset.
case_col
(BOOLEAN, NULLABLE) - true if the BigQuery column contains the patient or case id, i.e. if this column is used to determine the value of the dicom_patient_id
column
table_name
(STRING, NULLABLE) - table name
column
(STRING, NULLABLE) - the actual column name in the table. For ACRIN collections the column_name
is the variable_name
from the provided data dictionary. For other collections it is a name constructed by 'normalizing' the column_label
(see next) in a format that can be used as a BigQuery field name
column_label
(STRING, NULLABLE) - a 'free form' label for the column that does not need to conform to the BigQuery column format requirements. For ACRIN collections this is the variable_label
given by a data dictionary that accompanies the collection. For other collections it is the name or label of the clinical attribute as inferred from the source document during the curation process
data_type
(STRING, NULLABLE) - the type of data in this column. Again for ACRIN collections this is provided in the data dictionary. For other collections it is inferred by analyzing the data during curation
original_column_headers
(STRING, REPEATED) - the name(s) or label(s) in the source document that were used to construct the column_label
field. In most cases there is one column label in the source document that perscribes the column_label
. In some cases, multiple columns are concantenated and reformated to form the column_label
values
(RECORD, REPEATED) - a structure that is borrowed from the ACRIN data model. This is an array that contains observerd attribute values for this given column. For ACRIN collections these values are reported in the data dictionary. For most other collections these values are determined by analyzing the source data. For simplicity this field is left blank when the number of unique values is greater than 20
values.option_code
(STRING, NULLABLE) - a unique attribute value found in this column
values.option_description
(STRING, NULLABLE) - a description of the option_code
as provided by a data dictionary. For collections that do not have a data dictionary this is null.
values_source
(STRING, NULLABLE) - indicates the source of the values
records. The text 'provided dictionary' indicates that the records were obtained from a provided data dictionary. The text 'derived from inspection of values' indicates that the records were determined by automated analysis of the source materials during the ETL process that generated the BigQuery tables.
files
(STRING, REPEATED) - names of the files that contain the source data for each batch. These are the Excel or CSV files directly downloaded from TCIA, or the files extracted from downloaded ZIP files
sheet_names
(STRING, REPEATED) - for Excel-sourced files, the sheet names containing this column's values for each batch
batch
(INTEGER, REPEATED) - source batches that contain this particular column. Some columns or attributes may be missing from some batches
column_numbers
(STRING, REPEATED) - for each source batch, the column in the original source corresponding to this column in the BigQuery table
Check out our tutorial for a brief hands-on introduction into IDC clinical data!
Clinical data is often critical in understanding imaging data, and is essential for the development and validation of imaging biomarkers. However, such data is most often stored in spreadsheets that follow conventions specific to the site that collected the data, may not be accompanied by the dictionary defining the terms used in describing clinical data, and is rarely harmonized. This can be observed on the example of various collections ingested into IDC from The Cancer Imaging Archive (TCIA), such as the .
With the release v11 of IDC, we make the attempt to lower the barriers for accessing clinical data accompanying IDC imaging collections. We parse collection-specific tables, and organize the underlying data into BigQuery tables that can be accessed using standard SQL queries. You can also see the summary of clinical data available for IDC collections in .
As of Version 11, IDC has provided a public with clinical data associated with several of its imaging collections. The clinical data tables associated with a particular version are in the dataset bigquery-public-data.idc_<idc_version_number>_clinical
. In addition the dataset bigquery-public-data.idc_current_clinical
has an identically named view for each table in the BQ clinical dataset corresponding to the current IDC release.
There are currently 130 tables with clinical data representing 70 different collections. Most of this data was curated from Excel and CSV files downloaded from . For most collections data is placed in a single table named <collection_id>_clinical
, where <collection_id>
is the name of the collection in a standardized format (i.e. the idc_webapp_collection_id
column in the dicom_all
view in the ).
Collections from the ACRIN project have different types of clinical data spread across CSV files, and so this data is represented by several BigQuery tables. The clinical data for collections in the program is not curated from TCIA but instead is copied from a in the ISB-CGC project, which in turn was sourced from the . Similarly clinical data for collections in the is copied from the table tcga_clinical_rel9
in the idc_current
dataset, which was also created using the . Every clinical data table contains two fields we have introduced, dicom_patient_id
and source_batch
. dicom_patient_id
is identical to the PatientID
field in the DICOM files that correspond to the given patient. The dicom_patient_id
value is determined by inspecting the patient column in the clinical data file. In some of the collections' clinical data, the patients are separated into different 'batches' i.e. different source files, or different sheets in the same Excel file. The source_batch
field is an integer indicating the 'batch' for the given patient. For most collections, in which all patients data is found in the same location, the source_batch
value is zero.
Most of the clinical tables are legible by themselves. Tables from the ACRIN collection are an exception as the column names and some of the column values are coded. To provide for clarity and ease of use of all clinical data, we have created two metadata tables, and that provide information about the structure and provenance of all data in this dataset. table_metadata
has table-level metadata about each clinical collection, while column_metadata
has column-level metadata.
Structure of the table:
Structure of table:
IDC relies on DICOM data model for organizing images and image-derived data. At the same time, IDC includes certain attributes and data types that are outside of the DICOM data model. The Entity-Relationship (E-R) diagram and examples below summarize a simplified view of the IDC data model (you will find the explanation of how to interpret the notation used in this E-R diagram in from Mermaid documentation).
Collections are organized into Programs, which group related collections, or those collections that were contributed under the same funding initiative or a consortium. Example: TCGA program contains TCGA-GBM, TCGA-BRCA and other collections. You will see Collections nested under Programs in the upper left section of the . You will also see the list of collections that meet the filter criteria in the top table on the right-hand side of the portal interface.
IDC updates its data offering at the intervals of 2-4 months, with the data releases timing driven by the availability of new data, updates of existing data, introduction of new capabilities and various priority considerations. You can see the historical summary of IDC releases in .
BigQuery: within bigquery-public-data
project, idc_current
dataset contains table "views" to effectively provide an alias for the latest IDC data release. To find the actual IDC data release number, expand the list of datasets under bigquery-public-data
project, and search for the ones that follow the pattern `idc_v<number>`. The one with the largest number corresponds to the latest released version, and will match the content in idc_current
(related Google bug ).
directly or using : while this approach is most convenient, it allows searching using a small subset of attributes, defines cohorts only in terms of cases that meet the defined criteria, and has very limited options for combining multiple search criteria
tables via : this approach is most powerful, as it allows the use of to define the cohort, while leveraging the expressiveness of SQL in defining the selection logic, and allows to define cohort at any level of the data model hierarchy (i.e., instances, series, studies or cases)
Because DICOM SOPInstanceUIDs
, SeriesInstanceUIDs
or StudyInstanceUIDs
can remain invariant even when the composition of an instance, series or study changes, IDC assigns each version of each instance, series or study a to uniquely identify it and differentiate it from other versions of the same DICOM object.
As we will see in , the UUID of a (version of an) instance, and the UUID of the (version of a) series to which it belongs, are used in forming the object (file) name of the corresponding GCS and AWS objects. In addition, each instance version has a corresponding GA4GH DRS object, identified by a GUID based on the instance version's UUID. Refer to the section for details.
As described in the section, a UUID identifies a particular version of an IDC data object. Thus, there is a UUID for every version of every DICOM instance in IDC hosted data. An IDC BigQuery manifest optionally includes the UUID (called a crdc_instance_uuid) of each instance (version) in the cohort.
From the specification:
Each such UUID can be used to form a that has been indexed by the (DCF), and can be used to access data that defines that object. In particular this data includes the GCS and AWS URLs of the DICOM instance file. Though the GCS or AWS URL of an instance might change over time, the UUID of an instance can always be resolved to obtain its current URLs. Thus, for long term curation of data, it is recommended to record instance UUIDs.
The data object returned by the server is a GA4GH DRS :
IDC approach to storage and management of DICOM data is relying on the Google Cloud Platform . We maintain three representations of the data, which are fully synchronized and correspond to the same dataset, but are intended to serve different use cases.
In order to access the resources listed below, it is assumed you have completed the to access Google Cloud console!
All of the resources listed below are accessible under the .
Storage Buckets are basic containers in Google Cloud that provide storage for data objects (you can read more about the relevant terms in the Google Cloud Storage documentation ).
You can read about accessing GCP storage buckets from a Compute VM .
Assuming you have a list of GCS URLs in gcs_paths.txt
, you can download the corresponding items using the command below, substituting $PROJECT_ID
with the valid GCP Project ID (see the complete example in ):
Google is a massively-parallel analytics engine ideal for working with tabular data. Data stored in BQ can be accessed using queries.
IDC utilizes the standard capabilities of the Google Healthcare API to extract all of the DICOM metadata from the hosted collections into a single BQ table. Conventions of how DICOM attributes of various types are converted into BQ form are covered in the Healthcare API documentation article.
sequences that have more than 15 levels of nesting are not extracted (see ) - we believe this limitation does not affect the data stored in IDC
: DICOM metadata for all of the data hosted by IDC
: collection-level metadata for the original TCIA data collections hosted by IDC, for the most part corresponding to the content available in
`` : collection-level metadata for the TCIA analysis collections hosted by IDC, for the most part corresponding to the content available in
In addition to the tables above, we provide the following (virtual tables defined by queries) that extract specific subsets of metadata, or combine attributes across different tables, for convenience of the users
: DICOM metadata together with the collection-level metadata
``: attributes of the segments stored in DICOM Segmentation object
: measurement group sequences extracted from the DICOM SR TID1500 objects
: coded evaluation results extracted from the DICOM SR TID1500 objects
: quantitative evaluation results extracted from the DICOM SR TID1500 objects
IDC MVP utilizes a single Google Healthcare DICOM store to host all of the collections. That store, however, is primarily intended to support visualization of the data using OHIF Viewer. At this time, we do not support access of the hosted data via DICOMWeb interface by the IDC users. See more details in the , and please comment about your use case if you have a need to access data via the DICOMweb interface.
BigQuery TCGA clinical data: . Note that this table is hosted under the ISB-CGC Google project, as documented , and its location may change in the future!
If you have questions or feedback about the download tools provided by IDC, please reach out via our - we are very interested in hearing your feedback and suggestions!
is a python package designed to simplify access to IDC data. Assuming you have Python installed on your computer (if for some reason you do not have Python, you can check out legacy download instructions ), you can get this package with pip
like this:
Once installed, you can use it to explore, search, select and download corresponding files as shown in the examples below. You can also take a look at a short tutorial on using idc-index
.
idc-index
includes a variety of other helper functions, such as download from the manifest created using IDC portal, automatic generation of the viewer URLs, information about disk space needed for a given collection, and more. We are very interested in your feedback to define the additional functionality to add to this package! Please reach out via if you have any suggestions.
is a free open source, cross-platform, extensible desktop application developed to support a variety of medical imaging research use cases.
IDC maintains , an extension of 3D Slicer, developed to support direct access to IDC data from your desktop. You will need to a recent 3D Slicer 5.7.0 preview application (installers are available for Windows, Mac and Linux), and next use 3D Slicer ExtensionManager to install SlicerIDCBrowser extension. Take a look at the quick demo video in if you have never used 3D Slicer ExtensionManager before.
As discussed in this community forum post, from public access collections to limited access. At the moment, we still keep those files that used to be public in IDC before the decision made by TCIA, and the metadata for those files is still accessible in our BigQuery tables, but you cannot download those “Limited” access files referenced by gcs_url
from IDC.
As discussed in the issue will manifest itself in an error accessing gcs_url
that corresponds to a non-public file:
has a column named access
, which takes values Public
or Limited
that define if the file corresponding to the instance can be accessed. For all practical purposes, if you interact with the IDC BigQuery tables, you should make sure you exclude “Limited” access items using the following clause in your query:
Storage Buckets are basic containers in Google Cloud that provide storage for data objects (you can read more about the relevant terms in the Google Cloud Storage documentation
All IDC DICOM file data for all IDC data versions and all of the are maintained in Google Cloud Storage (GCS). Currently all DICOM files are maintained in GCS buckets that allow for free egress within or out of the cloud, enabled through the partnership of IDC with .
The object namespace is flat, where every object name is composed of a standard format CRDC UUID and with the ".dcm" file extension, e.g. 905c82fd-b1b7-4610-8808-b0c8466b4dee.dcm
. For example, that instance can be accessed using as gs://idc-open/905c82fd-b1b7-4610-8808-b0c8466b4dee.dcm
You can read about accessing GCP storage buckets from a Compute VM .
Egress of IDC data out of the cloud is free, since IDC data is participating in !
Typically, the user would not interact with the storage buckets to select and copy files (unless the intent is to copy the entire content hosted by IDC). Instead, one should use either the IDC Portal or IDC BigQuery tables containing file metadata, to identify items of interest and define a cohort. The cohort manifest generated by the IDC Portal can include both the Google Storage URLs for the corresponding files in the bucket, and the , which can be resolved to the Google Storage URLs to access the files.
Assuming you have a list of GCS URLs in a file gcs_paths.txt
, you can download the corresponding items using the command below, substituting $PROJECT_ID
with the valid GCP Project ID (see the complete example in ):
Google is a massively-parallel analytics engine ideal for working with tabular data. Data stored in BQ can be accessed using queries.
bigquery-public-data.idc_v<idc_version_number>.auxiliary_metadata
(also available via the view.) This table defines the contents of the corresponding IDC version. There is a row for each instance in the version.
Collection attributes:
bigquery-public-data.idc_v<idc_version_number>.dicom_metadata
(also available via view for the current version of IDC data) DICOM metadata for each instance in the corresponding IDC version. IDC utilizes the standard capabilities of the Google Healthcare API to extract all of the DICOM metadata from the hosted collections into a single BQ table. Conventions of how DICOM attributes of various types are converted into BQ form are covered in the Google Healthcare API documentation article. IDC users can access this table to conduct detailed exploration of the metadata content, and build cohorts using fine-grained controls not accessible from the IDC portal. The schema is too large to document here. Refer to the BQ table and the above referenced documentation.
sequences that have more than 15 levels of nesting are not extracted (see ) - we believe this limitation does not affect the data stored in IDC
bigquery-public-data.idc_v<idc_version_number>.original_collections_metadata
(also available via the view) This table is comprised of IDC data Collection-level metadata for the original TCIA data collections hosted by IDC, for the most part corresponding to the content available in . One row per collection:
bigquery-public-data.idc_v<idc_version_number>.analysis_results_metadata
(also available via the view for the current version of IDC data) Metadata for the TCIA analysis results hosted by IDC, for the most part corresponding to the content available in . One row per analysis result:
The following (virtual tables defined by queries) extract specific subsets of metadata, or combine attributes across different tables, for convenience of the users
bigquery-public-data.idc_v<idc_version_number>.dicom_all
(also available via view for the current version of IDC data) DICOM metadata together with selected auxiliary and collection metadata
bigquery-public-data.idc_v<idc_version_number>.segmentations
(also available via view for the current version of IDC data) Attributes of the segments stored in DICOM Segmentation objects
bigquery-public-data.idc_v<idc_version_number>.measurement_groups
(also available via``
bigquery-public-data.idc_v<idc_version_number>.qualitative_measurements
(also available via view for the current version of IDC data) Coded evaluation results extracted from the DICOM SR TID1500 objects
bigquery-public-data.idc_v<idc_version_number>.quantitative_measurements
(also available via view for the current version of IDC data) Quantitative evaluation results extracted from the DICOM SR TID1500 objects
IDC hosts a subset of the NLST clinical data, which was cleared for public sharing. If you need the full clinical data, please visit the .
The following tables contain NLST specific metadata. The detailed schema of those tables is available from the .
``: "Lung Cancer"
``: "SCT Abnormalities"
``: "SCT Comparison Abnormalities"
``: "Participant"
``: "SCT Screening"
IDC utilizes a single Google Healthcare DICOM store to host all of the instances in the current IDC version. That store, however, is primarily intended to support visualization of the data using OHIF Viewer. At this time, we do not support access of the hosted data via DICOMWeb interface by the IDC users. See more details in the , and please comment about your use case if you have a need to access data via the DICOMweb interface.
BigQuery TCGA clinical data: . Note that this table is hosted under the ISB-CGC Google project, as documented , and its location may change in the future!
Check out our tutorial for a brief hands-on introduction into IDC clinical data! You can also see the high-level summary of the clinical data attributes accompanying IDC data in .
Clinical data is often critical in understanding imaging data, and is essential for the development and validation of imaging biomarkers. However, such data is most often stored in spreadsheets that follow conventions specific to the site that collected the data, may not be accompanied by the dictionary defining the terms used in describing clinical data, and is rarely harmonized. This can be observed on the example of various collections ingested into IDC from The Cancer Imaging Archive (TCIA), such as the .
With the release v11 of IDC, we make the attempt to lower the barriers for accessing clinical data accompanying IDC imaging collections. We parse collection-specific tables, and organize the underlying data into BigQuery tables that can be accessed using standard SQL queries. You can also see the summary of clinical data available for IDC collections in .
As of Version 11 IDC provides a public with clinical data associated with several of its imaging collections. The clinical data tables associated with a particular version are in the dataset bigquery-public-data.idc_<idc_version_number>_clinical
. In addition the dataset bigquery-public-data.idc_current_clinical
has an identically named view for each table in the BQ clinical dataset corresponding to the current IDC release.
There are currently 130 tables with clinical data representing 70 different collections. Most of this data was curated from Excel and CSV files downloaded from . For most collections, data is placed in a single table named <collection_id>_clinical
, where <collection_id>
is the name of the collection in a standardized format (i.e. the idc_webapp_collection_id
column in the dicom_all
view in the ).
Collections from the ACRIN project have different types of clinical data spread across CSV files, and so this data is represented by several BigQuery tables. The clinical data for collections in the program is not curated from TCIA but instead is copied from a in the ISB-CGC project, which in turn was sourced from the . Similarly clinical data for collections in the is copied from the table tcga_clinical_rel9
in the idc_current
dataset, which was also created using the . Every clinical data table contains two fields we have introduced, dicom_patient_id
and source_batch
. dicom_patient_id
is identical to the PatientID
field in the DICOM files that correspond to the given patient. The dicom_patient_id
value is determined by inspecting the patient column in the clinical data file. In some of the collections' clinical data, the patients are separated into different 'batches' i.e. different source files, or different sheets in the same Excel file. The source_batch
field is an integer indicating the 'batch' for the given patient. For most collections, in which all patients data is found in the same location, the source_batch
value is zero.
Most of the clinical tables are legible by themselves. Tables from the ACRIN collection are an exception as the column names and some of the column values are coded. To provide for clarity and ease of use of all clinical data, we have created two metadata tables, and that provide information about the structure and provenance of all data in this dataset. table_metadata
has table-level metadata about each clinical collection, while column_metadata
has column-level metadata.
Structure of the table:
Structure of table:
With this approach you will follow a a 2-step process covered on this page:
Step 2: given the manifest, download files to your computer or to a cloud VM using s5cmd
command line tool.
Start with the query templates provided below, modify them based on your needs, and save the result in a file query.txt
. The specific values for PatientID
, SeriesInstanceUID
, StudyInstanceUID
are chosen to serve as examples.
Queries below demonstrate how to get the Google Storage URLs to download cohort files.
If you want to download the files corresponding to the cohort from GCP instead of AWS, substitute series_aws_url
for series_gcp_url
in the SELECT
statement of the query, such as in the following SELECT clause:
Next, use a Google Cloud SDK bq query
command (from command line) to run the query and save the result into a manifest file, which will be the list of GCP URLs that can be used to download the data.
Make sure you adjust the --max_rows
parameter in the queries above to be equal or exceed the number of rows in the result of the query, otherwise your list will be truncated!
You can also get the total disk space that will be needed for the files that you will be downloading:
You can verify if your setup was successful by running the following command: it should successfully download one file from IDC.
Once s5cmd
is installed, you can use s5cmd run
command to download the files corresponding to the manifest.
If you defined manifest that references AWS buckets:
If you defined manifest that references GCP buckets, you will need to specify GCS endpoint:
The slides below give a quick guided overview of how you can use IDC Portal.
No login is required to use the portal, to visualize images, or to download data from IDC!
Due to the existing limitations of Google Healthcare API, not all of the DICOM attributes are extracted and are available in BigQuery tables. Specifically:
sequences that contain around 1MiB of data are dropped from BigQuery export and RetrieveMetadata output currently. 1MiB is not an exact limit, but it can be used as a rough estimate of whether or not the API will drop the tag (this limitation was not documented as of writing this) - we know that some of the instances in IDC will be affected by this limitation. The fix for this limitation is targeted for sometime in 2021, according to the communication with Google Healthcare support.
In the following subsections you will find notebooks that don't require python programming, or have dependencies that make them not suitable for the python notebook format.
An IDC manifest may include study and/or series GUIDs that can be resolved to the underlying DICOM instance files in GCS. Such use of GUIDs in a manifest enables a much shorter manifest compared to a list of per-instance GCS URLs. Also, as explained below, a GUID is expected to be resolvable even when the data which it represents has been moved.
In IDC, we use the term GUID to mean a persistent identifier that can be resolved to a GA4GH DrsObject. GUID persistence ensures that the data which the GUID represents can continue to be located and accessed even if it has been moved to a different hosting site.
This is a typical UUID:
641121f1-5ca0-42cc-9156-fb5538c14355
of a (version of a) DICOM instance, and this is the corresponding CRDC GUID:
dg.4DFC/641121f1-5ca0-42cc-9156-fb5538c14355
>> curl https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects/dg.4DFC/641121f1-5ca0-42cc-9156-fb5538c14355
returns:
which is a DrsObject. Because we resolved the GUID of an instance, the access_methods in the returned DrsObject includes a URL at which the corresponding DICOM entity can be accessed.
When the GUID of a series is resolved, the DrsObject that is returned does not include access methods because there are no series file objects. Instead, the contents
component of the returned DrsObject contains the URLs that can be accessed to obtain the DrsObjects of the instances in the series.
Thus, we see that when we resolve dg.4DFC/cc9c8541-949d-48d9-beaf-7028aa4906dc
, the GUID of the series containing the instance above:
curl -o foo https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects/dg.4DFC/cc9c8541-949d-48d9-beaf-7028aa4906dc
we see that the contents
component includes the GUID of that instance as well as the GUID of another instance:
Similarly, the GUID of a DICOM study resolves to a DrsObject whose contents
component consists of the GUIDs of the series in that study.
At this time, most GUIDs have not been registered with the CRDC. If such a GUID is presented to the CRDC for resolution, an HTTP 404 error is returned.
As discussed in the Organization of data section of this document, the DICOM instance file naming convention changed with IDC version 2. At this time, when an instance GUID is resolved, the returned DrsObject may method may include a URI to the V1 GCS bucket location. Those GUID will re-indexed such that in the future they point to the new GCS bucket location.
Next, open QuPath and select "File > Open"
.
Choose just one of the .dcm
files that belong to the desired dataset, then click Open
. The remaining files will be automatically detected and should not be selected.
Zooming and panning in real time:
The Image
tab on the left side of the screen shows dimension information, and lists any associated images. In this case, a thumbnail image is present under Associated Images
at the bottom of the Image
tab. Double-clicking on Series 1 (THUMBNAIL)
will open the thumbnail image in a separate window:
For this part, we will use a slide from the HTAN-OHSU collection identified by SeriesInstanceUID 1.3.6.1.4.1.5962.99.1.1999932010.1115442694.1655562373738.4.0
. As before, you can download it as follows:
As in the brightfield case, open QuPath and select File > Open
.
Choose just one of the .dcm
files in the dataset, as the other files will be automatically detected. It does not matter which file is selected. When prompted, set the image type to Fluorescence
, or as appropriate for the dataset:
The Image
tab indicates the number of channels (12 in this case). By default, all channels will be displayed at once. This can be changed by selecting View > Brightness/Contrast
or the "half-circles" icon in the toolbar:
Unchecking the Show
box will hide the channel's data, and update the image.
Make sure you first review the section to learn about the simpler interfaces that provide access to IDC data.
SlicerIDCBrowser and idc-index
discussed in the previous section aim to provide simple interfaces for data access. In some situations, however, you may want to build cohorts using metadata attributes that are not exposed in those tools. In such cases you will need to use BigQuery interface to form your cohort and build a file manifest that you can then use with to download the files.
Step 1: create a manifest - a list of the storage bucket URLs of the files to be downloaded. if you want to download the content of the cohort defined in the IDC Portal, , and proceed to Step 2. Alternatively, you can use BigQuery SQL as discussed below to generate the manifest;
To learn more about using Google BigQuery SQL with IDC, check out part 3 of our , which demonstrates how to query and download IDC data!
A download manifest can be created using either the IDC Portal, or by executing a BQ query. If you have generated a manifest using the IDC Portal, as discussed , proceed to Step 2! In the remainder of this section we describe creating a manifest from a BigQuery query.
The BigQuery table discussed in can be used to subset the files you need based on the DICOM metadata attributes as needed, utilizing the SQL query interface. The gcs_url
and aws_url
columns contain Google Cloud Storage and AWS S3 URLs, respectively, that can be used to retrieve the files.
You can use IDC Portal to identify items of interest, or you can use SQL queries to subset your data using any of the DICOM attributes. You are encouraged to use the to test your queries and explore the data first!
For any of the queries, you can get the count of rows to confirm that the --max_rows
parameter is sufficiently large (use the to run these queries):
is a very fast S3 and local filesystem execution tool that can be used for accessing IDC buckets and downloading files both from GCS and AWS.
Install s5cmd
following the instructions in , or if you have Python pip on you system you can just do pip install s5cmd --upgrade
.
Theprovides a web-based interactive interface to browse the data hosted by IDC, visualize images, build manifests describing selected cohorts, and download images defined by the manifests.
Indexing of the collection of by the Data Commons Framework is pending.
: only items corresponding to the LIDC-IDRI original collection are included
: only items corresponding to the ISPY1 original collection are included
: Some of the segmentations in this collection are empty (as an example, SeriesNumber 42100 with SeriesDescription "VOI PE Segmentation thresh=70" in is empty).
sequences that have more than 15 levels of nesting are not extracted (see ) - we believe this limitation does not affect the data stored in IDC
This section of the documentation complements the tutorials available in our notebooks repository:
: all of the pathology images in IDC are in DICOM Slide Microscopy format; this notebook will help you get started with using this representation and also searching IDC pathology images.
: introduction to the key metadata accompanying IDC slide microscopy images that can be used for subsetting data and building cohorts.
From the specification:
As described in the section, a UUID identifies a particular version of an IDC data object. There is a UUID for every version of every DICOM instance, series, and study in IDC hosted data. Each such UUID can be used to form a GUID that is registered by the NCI Cancer Research Data Commons (CRDC), and can be used to access the data that defines that object.
A GUID can be resolved by appending it to this URL, which is the GUID resolution service within CRDC: . For example, the following curl
command:
is a popular open-source desktop application for visualizing and annotating slide microscopy images. It is integrated with both OpenSlide and BioFormats libraries, and as of the current QuPath 0.5.1 version supports direct loading of DICOM Slide Microscopy images. In this tutorial you will learn how to use DICOM SM images from IDC with QuPath.
First you will need to download a sample SM image from IDC to your desktop. To identify a sample image, you can navigate to the IDC Portal, copy SeriesInstanceUID
value for a sample SM series you want to download. Given that UID, you can download the corresponding files using idc-index
python package (see details in the documentation section describing data d).
In this tutorial, we will use the series identified by SeriesInstanceUID from the collection 1.3.6.1.4.1.5962.99.1.3140643155.174517037.1639523215699.2.0
, which you can download as follows:
When prompted for an image type, select Brightfield H&E
(or whatever is appropriate for the dataset being opened), then click Apply
. This is a QuPath feature intended to aid in analysis, and is further described in the .
The image should now display, and can be navigated by zooming/panning as described in the .
The image should then display, and can be navigated by zooming/panning as described in the .
DICOM SR uses data elements to encode a higher level abstraction that is a tree of content, where nodes of the tree and their relationships are formalized. SR-TID1500 is one of many standard templates that define constraints on the structure of the tree, and is intended for generic tasks involving image-based measurements. DICOM SR uses standard terminologies and codes to deliver structured content. These codes are used for defining both the concept names and values assigned to those concepts (name-value pairs). Measurements include coded concepts corresponding to the quantity being measured, and a numeric value accompanied by coded units. Coded categorical or qualitative values may also be present. In SR-TID1500, measurements are accompanied by additional context that helps interpret and reuse that measurement, such as finding type, location, method and derivation. Measurements computed from segmentations can reference the segmentation defining the region and the image segmented, using unique identifiers of the respective objects.
At this time, only the measurements that accompany regions of interest defined by segmentations are exposed in the IDC Portal, and in the measurements views maintained by IDC!
Tools referenced above can be used to 1) extract qualitative evaluations and quantitative measurements fro the SR-TID1500 document; 2) generate standard-compliant SR-TID1500 objects.
The segmentation of an instance in each of the following series was excluded due to having a DICOM PixelData size greater than or equal to 2GB:
1.2.826.0.1.3680043.10.511.3.10544506665348704312902213950958190
1.2.826.0.1.3680043.10.511.3.11183783347037364699862133130586654
1.2.826.0.1.3680043.10.511.3.11834745481756047014039855874680259
1.2.826.0.1.3680043.10.511.3.11901667084519361717338400810055642
1.2.826.0.1.3680043.10.511.3.12041600048156613329793822566495651
1.2.826.0.1.3680043.10.511.3.12718116375608495830041119776887887
1.2.826.0.1.3680043.10.511.3.13386724401829265460622415500801368
1.2.826.0.1.3680043.10.511.3.14042734131864468280344737986870899
1.2.826.0.1.3680043.10.511.3.17374765903080083648409690755539184
1.2.826.0.1.3680043.10.511.3.17429002643681869326389465422353495
1.2.826.0.1.3680043.10.511.3.20359930476040698387716730891020638
1.2.826.0.1.3680043.10.511.3.28397033639127902823368316410884210
1.2.826.0.1.3680043.10.511.3.28425539132321749931109935391487352
1.2.826.0.1.3680043.10.511.3.34574227972763695321794092913087775
1.2.826.0.1.3680043.10.511.3.36216094237641867532902805456135029
1.2.826.0.1.3680043.10.511.3.39533936694797964318706337783276378
1.2.826.0.1.3680043.10.511.3.39900930856460689132625586523683939
1.2.826.0.1.3680043.10.511.3.41633795217567037218184715094985555
1.2.826.0.1.3680043.10.511.3.42218106649761752724553401155203874
1.2.826.0.1.3680043.10.511.3.49098870621170235412220976183110770
1.2.826.0.1.3680043.10.511.3.50064322235999800062455171235601125
1.2.826.0.1.3680043.10.511.3.50905421517530127976832505410705816
1.2.826.0.1.3680043.10.511.3.62935684444056080516153739948364303
1.2.826.0.1.3680043.10.511.3.73572792121235596011940904319511291
1.2.826.0.1.3680043.10.511.3.74494366757564543824303304482444570
1.2.826.0.1.3680043.10.511.3.79988146996803179892075404247166692
1.2.826.0.1.3680043.10.511.3.80004293150506819482091023564947091
1.2.826.0.1.3680043.10.511.3.82774274518897141254234567300292686
1.2.826.0.1.3680043.10.511.3.84202416467561501610598853920808906
1.2.826.0.1.3680043.10.511.3.86214492184712627544696209982376598
1.2.826.0.1.3680043.10.511.3.90193069664920622990317347485104073
1.2.826.0.1.3680043.10.511.3.95666157880521064637011880609274546
1.2.826.0.1.3680043.10.511.3.96676982370873257329281821215166082
1.2.826.0.1.3680043.10.511.3.98258035017480972315346136181769675
New pathology collections
New analysis results
Revised radiology collections
Cancer Moonshot Biobank (CMB) radiology images were updated to fix incorrect values assigned to PatientID
(see details on the collection pages linked above). The updated images have different DICOM Study/Series/SOPInstanceUIDs.
Revised analysis results
New clinical metadata tables
New radiology collections
New analysis results
Revised radiology collections
(starred collections are revised due to new or revised analysis results)
Revised pathology collections
(starred collections are revised due to new or revised analysis results)
Also added missing instance SOPInstanceUID: 1.3.6.1.4.1.5962.99.1.3459553143.523311062.1687086765943.9.0
Removed corrupted instances
SOPInstanceUID: 1.3.6.1.4.1.5962.99.1.2164023716.1899467316.1685791236516.37.0
SOPInstanceUID: 1.3.6.1.4.1.5962.99.1.2411736851.773458418.1686038949651.37.0
SOPInstanceUID: 1.3.6.1.4.1.5962.99.1.2411736851.773458418.16860389
TCGA-DLBC (No description page)
New clinical metadata tables
Notes
The deprecated columns tcia_api_collection_id
and idc_webapp_collection_id
have been removed from the auxiliary_metadata
table in the idc_v18
BQ dataset. These columns were duplicates of columns collection_name
and collection_id
respectively.
New radiology collections
New analysis results
Collections analyzed:
Revised radiology collections
New clinical metadata tables
New radiology collections
New pathology collections
Revised radiology collections
New analysis results
New clinical metadata tables
New radiology collections
New pathology collections
Revised radiology collections
Revised pathology collections
New analysis results
Revised analysis results
New clinical metadata tables
New analysis results collection:
New clinical data collections:
New collections:
Updated collections:
Other:
Metadata corresponding to "limited" access collections are removed.
New clinical data collections:
Other clinical data updates:
Limited access collections are removed. Clinical metadata for the COVID-19-NY-SUB and ACRIN 6698/I-SPY2 Breast DWI collections now includes information ingested from data dictionaries associated with these collections. In v11 the string value 'NA' was being changed to null during the ETL process for some columns/collections. This is now fixed in v12 and the value 'NA' is preserved.
This release introduces clinical data ingested for a subset of collections, and now available via a dedicated BigQuery dataset.
New collections:
New collections:
Updated collections:
CPTAC, TCGA and NLST collections have been reconverted due to a technical issue identified with a subset of images included in v9.
TCGA-DLBC
TCGA-KIRP: PatientID
TCGA-5P-A9KA, StudyInstanceUID
2.25.191236165605958868867890945341011875563
TCGA-BRCA: PatientID
TCGA-OL-A66H, StudyInstanceUID
2.25.82800314486527687800038836287574075736 The affected files will be included in IDC when the infrastructure limitation is addressed.
Collection access level change:
This data release introduces the concept of differential license to IDC: some of the collections maintained by IDC contain items that have different licenses. As an example, radiology component of the TCGA-GBM collection is covered by the TCIA limited access license, and is not available in IDC, while the digital pathology component is covered by CC-BY. With this release, we complete sharing in full of the digital pathology component of the datasets released by the CPTAC and TCGA programs.
New collections:
Updated collections:
The main highlight of this release is the addition of the NLST and TCGA Slide Microscopy imaging data. New TCGA content includes introduction of new (to IDC) TCGA collections that have only slide microscopy component, and addition of the slide microscopy component to those IDC collections that were available earlier and included only the radiology component.
New collections
TCGA-DLBC (TCGA-DLBC collection does not have a description page)
Updated collections
The main highlight of this release is the addition of the Slide Microscopy imaging component to the remaining CPTAC collections.
New collections
Updated collections
Original collections:
Analysis results collections:
New collections:
New analysis results collections:
Updated collections:
1) CT images available as any other imaging collection (via IDC Portal, BigQuery metadata tables, and storage buckets);
3) One instance is missing from patient/study/series:
126153/1.2.840.113654.2.55.319335498043274792486636919135185299851/1.2.840.113654.2.55.262421043240525317038356381369289737801
4) Three instances are missing from patient/study/series:
215303/1.3.6.1.4.1.14519.5.2.1.7009.9004.337968382369511017896638591276/1.3.6.1.4.1.14519.5.2.1.7009.9004.180224303090109944523368212991
The DICOM Slide Microscopy (SM) images included in the collections above in IDC are not available in TCIA. TCIA only includes images in the vendor-specific SVS format!
New original collections:
New analysis results collections:
Original collections included:
Analysis collections included:
DICOM Radiotherapy Structure Sets (RTSS, or RTSTRUCT) define regions of interest by a set of planar contours.
RTSS objects can be identified by the RTSTRUCT
value assigned to the Modality
attribute, or by SOPClassUID
= 1.2.840.10008.5.1.4.1.1.481.3
.
As always, you get most of the power in exploring IDC metadata when using SQL interface. As an example, the query below will select a random study that contains a RTSTRUCT series, and return a URL to open that study in the viewer:
RTSTRUCT relies on unstructured text in describing the semantics of the individual regions segmented. This information is stored in the StructureSetROISequence.ROIName
attribute. The following query will return the list of all distinct values of ROIName
and their frequency.
DICOM Segmentation object (SEG) can be identified by SOPClassUID
= 1.2.840.10008.5.1.4.1.1.66.4
Unlike most "original" image objects that you will find in IDC, SEG belongs to the family of enhanced multiframe image objects, which means that it stores all of the frames (slices) in a single object. SEG can contain multiple segments, a segment being a separate label/entity being segmented, with each segment containing one or more frames (slices). All of the frames for all of the segments are stored in the PixelData
attribute of the object.
We recommend you use one of the following tools to interpret the content of the DICOM SEG and convert it into alternative representations:
Tools referenced above can be used to 1) extract volumetrically reconstructed mask images corresponding to the individual segments stored in DICOM SEG; 2) extract segment-specific metadata describing its content; 3) generate standard-compliant DICOM SEG objects from research formats.
IDC relies on DICOM for data modeling, representation and communication. Most of the data stored in IDC is in DICOM format. If you want to use IDC, you (hopefully!) do not need to become a DICOM expert, but you do need to have a basic understanding of how DICOM data is structured, and how to transform DICOM objects into alternative representations that can be used by the tools familiar to you.
This section is not intended to be a comprehensive introduction to the standard, but rather a very brief overview of some of the concepts that you will need to understand to better use IDC data.
Value Multiplicity (VM) defines the number of items of the prescribed VR that can be contained in a given data element.
It is critical to recognize that while all of the DICOM files at the high level are structured exactly in the same way and follow the same syntax and encoding rules, interpretation of the content of an individual file is dependent on the specific type of object it encodes!
When you use the IDC portal to build your cohort, unique identifiers for the object classes are mapped to their names, which are available under the "Object class" group of facets in the search interface.
We differentiate between the original and derived DICOM objects in the IDC portal and discussions of the IDC-hosted data. By Original objects we mean DICOM objects that are produced by image acquisition equipment - MR, CT, or PET images fall into this category. By Derived objects we mean those objects that were generated by means of analysis or annotation of the original objects. Those objects can contain, for example, volumetric segmentations of the structures in the original images, or quantitative measurements of the objects in the image.
Most of the images stored on IDC are saved as objects that store individual slices of the image in separate instances of a series, with the image stored in the PixelData
attribute.
As of production release, IDC contains both radiology and digital pathology images. The following publication can serve as a good introduction into the use of DICOM for digital pathology.
Open source libraries such as DCMTK, GDCM, ITK, and pydicom can be used to parse such files and load pixel data of the individual slices. Recovering geometry of the individual slices (spatial location and resolution) and reconstruction of the individual slices into a volume requires some extra consideration.
DICOM and TIFF are two different image file formats that share many similar characteristics, and are capable of encoding exactly the same pixel data, whether uncompressed, or compressed with common lossy schemes (including JPEG and JPEG 2000). This allow the pixel data to be losslessly transformed from one format to the other and back.
The DICOM file format was also deliberately designed to allow the two formats (TIFF and DICOM) to peacefully co-exist in the same file, sharing the same pixel data without expanding the file size significantly. This is achieved by leaving some unused space at the front of the DICOM file ("preamble"), which allows for the presence of a TIFF format recognition code ("magic number") and a pointer to its Image File Directory (IFD), which in turn contains pointers into the shared DICOM Pixel Data element.
The dual-personality mechanism supports both traditional strip-based TIFF organization, such as might be used to encode a single frame image, as well as the tile-based format, which is commonly used for Whole Slide Images (WSI), and which is encoded in DICOM with each tile as a frame of a "multi-frame" image.
Unlike TIFF files, which allow multiple different sized images to be encoded in the same file, DICOM does not, so there are limits to this approach. For example, though an entire WSI pyramid can be encoded in a TIFF file, the DICOM WSI definition requires each pyramid layer to be in a separate file, and all frames (tiles) within the same file to be the same size.
Most of the structural metadata that describes the organization and encoding of the pixel data is similar in DICOM and TIFF. It is copied into the tags (data elements) encoded in the respective format "headers". Biomedical-specific information, such as patient, specimen and anatomical identifiers and descriptions, as well as acquisition technique, is generally only encoded in the DICOM data elements, their being no corresponding standard TIFF tags for it. Limited spatial information (such as physical pixel size) can be encoded in TIFF tags, but more complex multi-dimensional spatial location is standardized only in the DICOM data elements.
The dictionary of TIFF tags can be extended with application-specific entries. This has been done for various non-medical and medical applications (e.g., GeoTIFF, DNG, DEFF). Other tools have used alternative mechanisms, such as defining text string (Leica/Aperio SVS) or structured metadata in other formats (such as XML for OME) buried within a TIFF string tag (e.g, ImageDescription). This approach can be used with DICOM-TIFF dual-personality files as well, since DICOM does not restrict the content of the TIFF tags; it does require updating or crafting of the textual metadata to actually reflect the characteristics of the encoded pixel data.
It is hoped that the dual-personality approach may serve to mitigate the impact of limited support of one format or the other in different clinical and research tools for acquisition, analysis, storage, indexing, distribution, viewing and annotation.
One of the fundamental principles of DICOM is the use of controlled terminologies, or lexicons, or coding schemes (for the purposes of this guide, these can be used interchangeably). While using the DICOM data stored in IDC, you will encounter various situations where the data is captured using coded terms.
Controlled terminologies define a set of codes, and sometimes their relationships, that are carefully curated to describe entities for a certain application domain. Consistent use of such terminologies helps with uniform data collection and is critical for harmonization of activities conducted by independent groups.
When codes are used in DICOM, they are saved as triplets that consist of
CodeValue: unique identifier for a term
CodingSchemeDesignator: code for the authority that issued this code
CodeMeaning: human-readable code description
The following white papers are intended to provide explanation and clarification into applying DICOM to encoding specific types of data.
Components on the left side of the page give you controls for configuring your selection:
Panels on the right side will automatically update based on what you select on the left side!
Selection configuration reflects the active search scope/filters in the Cohort Filters section. You can download all of the studies that match your filters. Below you will see the Cart section. Cart is helpful when selecting data by individual filters is too imprecise, and you want to have more granular control over your selection by selecting specific collections/patients/studies/series.
Filtering results section consists of the tables containing matching content that you can navigate following IDC Data model: first table shows the matching collections, selecting a collection will list matching cases (patients), selection of a case will populate the next table listing matching studies for the patient, and finally selecting a study will expand the final table with the list of series included in the study.
In the following sections of the documentation you will learn more about each of the items we just discussed.
The DICOM data model is implicit, and is not defined in a machine-readable structured form by the standard!
In this section we discuss derived DICOM objects, including annotations, that are stored in IDC. It is important to recognize that, in practice, annotations are often shared in non-standard formats. When IDC ingests a dataset where annotations are available in such a non-standard representation, those need to be harmonized into a suitable DICOM object to be available in IDC. Due to the complexity of this task, we are unable to perform such harmonization for all of the datasets. If you want to check if there are annotations in non-DICOM format available for a given collection, you should locate the original source of the data, and examine the accompanying documentation for available non-DICOM annotations.
Non-standard annotations are not searchable, usually are not possible to visualize in off-the-shelf tools, and require custom code to interpret and parse. The situation is different for the DICOM derived objects that we discuss in the following sections.
In IDC we define "derived" DICOM objects as those that are obtained by analyzing or post-processing the "original" image objects. Examples of derived objects can be annotations of the images to define image regions, or to describe findings about those regions, or voxel-wise parametric maps calculated for the original images.
Although DICOM standard provides a variety of mechanisms that can be used to store specific types of derived objects, most of the image-derived objects currently stored in IDC fall into the following categories:
The type of the object is defined by the object class unique identifier stored in the SOPClassUID
attribute of each DICOM object. In the IDC Portal we allow the user to define the search filter based on the human-readable name of the class instead of the value of that identifier.
You can find detailed descriptions of these objects applied to specific datasets in TICA in the following open access publications:
Open source DCMTK tool can be used to render the content of the DICOM SR tree in a human-readable form (you can see one example of such rendering ). Reconstructing this content using tools that operate with DICOM content at the level of individual attributes can be tedious. We recommend the tools referenced above that also provide capabilities for reading and writing SR-TID1500 content:
: high-level DICOM abstractions for the Python programming language
: open source DCMTK-based C++ library and command line converters that aim to help with the conversion between imaging research formats and the standard DICOM representation for image analysis results
: C++ library that provides API abstractions for reading and writing SR-TID1500 documents
SR-TID1500-specific metadata attributes are available in the table views maintained by IDC. See details .
Data hosted by IDC is ingested from several sources, including , , and .
Please refer to the license and terms of use, which are defined in the license_url
and source_doi
or source_doi
of the IDC BigQuery . You can filter the data by license type in the .
Collections analyzed:
Collections analyzed:
WARNING: After the release of v20, it was discovered that a mistake had been made during data conversion that affected the newly-released segmentations accompanying the "RMS-Mutation-Prediction" collection. Segmentations released in v20 for this collection have the segment labels for alveolar rhabdomyosarcoma (ARMS) and embryonal rhabdomyosarcoma (ERMS) switched in the metadata relative to the correct labels. Thus segment 3 in the released files is labelled in the metadata (the SegmentSequence) as ARMS but should correctly be interpreted as ERMS, and conversely segment 4 in the released files is labelled as ERMS but should be correctly interpreted as ARMS. We apologize for the mistake and any confusion that it has caused, and will be releasing a corrected version of the files in the next release as soon as possible. Collections analyzed:
Collections analyzed:
Collections analyzed:
Collections analyzed:
* Collections analyzed:
** Collections analyzed:
(revisions only to clinical data)
**
(fix PatientAges > 090Y)
(fix PatientAges > 090Y)
*
(All TCGA revisions are to correct multiple manufacturer values within same series)
Collections analyzed:
(TCIA description: (Repair of DICOM tag(0008,0005) to value "ISO_IR 100" in 79 series)
(Revised because results from CPTAC-CRCC-Tumor-Annotations were added)
(Revised because results from CPTAC-UCEC-Tumor-Annotations were added)
(Revised because results from CPTAC-PDA-Tumor-Annotations were added)
(ICDC-Glioma radiology added in a previous version)
(TCIA description: “Radiology modality data cleanup to remove extraneous scans.”)
(“TCIA description: Radiology modality data cleanup to remove extraneous scans.”)
(TCIA description: “Radiology modality data cleanup to remove extraneous scans.”)
(TCIA description: “Radiology modality data cleanup to remove extraneous scans.”)
(TCIA description: TCIA description: “Radiology modality data cleanup to remove extraneous scans.”)
(TCIA description: “Radiology modality data cleanup to remove extraneous scans.”)
(TCIA description: “Radiology modality data cleanup to remove extraneous scans.”)
(TCIA description: “Added DICOM version of MED_ABD_LYMPH_MASKS.zip segmentations that were previously available”)
(Revised because QIBA-VolCT-1B analysis results were added)
(Revised because analysis results from nnU-Net-BPR-Annotations were revised)
(Revised because analysis results from nnU-Net-BPR-Annotations were revised)
(11 pathology-only patients removed at request of data owner)
(1 pathology-only patient removed at request of data owner)
(Analysis of NLST and NSCLC-Radiomics)
(Annotations of NLST and NSCLC-Radiomics radiology)
This release does not introduce any new data, but changes the bucket organization and introduces replication of IDC files in Amazon AWS storage buckets, as described in .
In this release we introduce a new HTAN program including currently three collections release by the .
*
*
Note that the TCGA-KIRP and TCGA-BRCA collections (marked with the asterisk in the list above) are currently missing SM high resolution layer files/instances due to a of Google Healthcare that makes it not possible to ingest datasets that exceed some internal limits. Specifically, the following patient/studies are affected:
is now available as public access collection
The following collections became limited access due to the , which is the original source of those collections.
Outcome Prediction in Patients with Glioblastoma by Using Imaging, Clinical, and Genomic Biomarkers: Focus on the Nonenhancing Component of the Tumor ()
DICOM-SEG Conversions for TCGA-LGG and TCGA-GBM Segmentation Datasets ()
is added. The data included consists of the following components:
2) a subset of clinical data available in the BigQuery tables starting with nlst_
under the idc_v4
dataset, as documented in the section.
The following radiology collections were updated to include DICOM Slide Microscopy (SM) images converted from the original vendor-specific representation into .
Listed below are all of the and collections of currently hosted by IDC, with the links to the Digital Object Identifiers (DOIs) of those collections.
Listed below are all of the and collections of currently hosted by IDC, with the links to the Digital Object Identifiers (DOIs) of those collections.
(only items corresponding to the LIDC-IDRI original collection are included)
(only items corresponding to the ISPY1 original collection are included)
If you use the IDC Portal, you can select cases that include RTSTRUCT objects by selecting "Radiotherapy Structure Set" in the "Original" tab, "Modality" section (). Here is that contains an RTSS series.
We recommend tool for converting planar contours of the individual structure sets into volumetric representation.
If you use the IDC Portal, you can select cases that include SEG objects by selecting "Segmentations" in the "Modality" section () under the "Original" tab . Here is that contains a SEG series.
You can further explore segmentations available in IDC via the "Derived" tab of the Portal by filtering those by specific types and anatomic locations. As an example, will select cases that contain segmentations of a nodule.
Metadata describing the segments is contained in the SegmentSequence
of the DICOM object, and is also available in the BigQuery table view maintained by IDC in the BigQuery table. That table contains one row per segment, and for each segment includes metadata such as algorithm type and structure segmented.
: open source DCMTK-based C++ library and command line converters that aim to help with the conversion between imaging research formats and the standard DICOM representation for image analysis results
: high-level DICOM abstractions for the Python programming language
: C++ library that provides API abstractions for reading and writing SEG objects
SEG-specific metadata attributes are available in the table views maintained by IDC. See details .
Digital Imaging and Communications in Medicine (DICOM): A Practical Introduction and Survival Guide, 2nd Edition by Pianykh, Oleg S. published by Springer (2011).
As discussed in , the main mechanism for accessing the data stored in IDC is by using the storage buckets that contain individual files indexed through other interfaces. Each of the files in the IDC-maintained storage buckets encodes a DICOM object. Each DICOM object is a collection of data elements or attributes. Below is an example of a subset of attributes in a DICOM object, as generated by the IDC OHIF Viewer (which can be toggled by clicking the "Tag browser" icon in the IDC viewer toolbar):
The standard defines constraints on what kind of data each of the attributes can contain. Every single attribute defined by the standard is listed in the , which defines those constraints:
Value Representation (VR) defines the type of the data that data element can contain. There are 27 DICOM VRs, and they are defined in .
What attributes are included in a given object is determined by the type of object (or, to follow the DICOM nomenclature, Information Object). is dedicated to the definitions (IODs) of those objects.
How do you know what object is encoded in a given file (or instance of the object, using the DICOM lingo)? For this purpose there is an attribute SOPClassUID
that uniquely identifies the class of the encoded object. The content of this attribute is not easy to interpret, since it is a unique identifier. To map it to the specific object class name, you can consult the complete list of object classes available in .
A somewhat related attribute that hints at the type of object is Modality
, which is defined by the standard as "Type of equipment that originally acquired the data used to create the images in this Series", and is expected to take one of the values from . However, Modality
is not equivalent to SOPClassUID
, and should not be used as a substitute. As an example it is possible that data derived from the original modality could be saved as a different object class, but keep the value of modality identical.
Herrmann, M. D., Clunie, D. A., Fedorov, A., Doyle, S. W., Pieper, S., Klepeis, V., Le, L. P., Mutter, G. L., Milstone, D. S., Schultz, T. J., Kikinis, R., Kotecha, G. K., Hwang, D. H., Andriole, K. P., John Lafrate, A., Brink, J. A., Boland, G. W., Dreyer, K. J., Michalski, M., Golden, J. A., Louis, D. N. & Lennerz, J. K. Implementing the DICOM standard for digital pathology. J. Pathol. Inform. 9, 37 (2018).
: command-line tool to convert neuroimaging data from the DICOM format to the NIfTI format
: open source software for image computation, which includes
: python library providing API and command-line tools for converting DICOM images into NIfTI format
: python interface to the , includes .
Clunie, D. A. . J. Pathol. Inform. 10, 12 (2019).
Theprovides a web-based interactive interface to browse the data hosted by IDC, visualize images, build manifests describing selected cohorts, and download images defined by the manifests.
DICOM relies on various sources of codes, all of which are listed in of the standard.
As an example, if you query the view with the following query in the BQ console:
You will see columns that contain coded attributes of the segment. In the example below, the value of AnatomicRegion
corresponding to the segment is assigned the value (T-04000, SRT, Breast), where "SRT" is the coding scheme designator corresponding to the coding scheme.
As another example, quantitative and qualitative measurements extracted from the SR-TID1500 objects are stored in the and views, respectively. If we query those views to see the individual measurements, they also show up as coded items. Each of the quantitative measurements includes a code describing the quantity being measured, the actual numeric value, and a code describing the units of measurement:
Comments and questions regarding those white papers are welcomed from the community! Please ask any related questions on , or by adding comments directly in the documents referenced below:
, 2020
Search scope allows you to limit your search to just the specific programs, collections and analysis results (as discussed in the documentation of the ).
Search configuration gives you access to a small set of metadata attributes to select DICOM studies (where "DICOM studies" fit into IDC data model is also discussed in the page) that contain data that meets the search criteria.
DICOM defines its own model to map relevant entities from the real world. That model, as , is shown in the figure below.
DICOM data model entities do not always map to DICOM objects! In fact, every DICOM object you will ever encounter in IDC will contain attributes describing various properties of the entries at different levels of the real world data model. Such objects are called Composite Information Objects. The of the Composite Information Object Definitions is shown below, and covers all of the composite objects defined by the standard.
As can be observed from this diagram, "each Composite Instance IOD [Entity-Relationship] Model requires that all Composite Instances that are part of a specific Study shall share the same context. That is, all Composite Instances within a specific Patient Study share the same Patient and Study information; all Composite Instances within the same Series share the same Series information; etc." ().
Each of the boxes in the diagram above corresponds to Information Entities (IEs), which in turn are composed from Information Modules. Information Modules group attributes that are related. As an example, will include , which in turn will include such attributes as PatientID, PatientName, and PatientSex.
Make sure you complete IDC notebooks to get introduced into IDC data organization, download, visualization and other first-order topics.
As an example, the collection is available in IDC. If you mouse over the name of that collection in the IDC Portal, the tooltip will provide the overview of the collection and the link to the source.
You will also find the link to the source in the .
Finally, if you select data using SQL, you can use the source_DOI
and/or the source_URL column to identify the source of each file in the subset you selected (learn more about source_DOI
, licenses and attribution in the part 3 of our ).
For the collection in question, the source DOI is , and on examining that page you will see a pointer to the CSV file with the coordinates of the bounding boxes defining regions containing lesions.
voxel segmentations stored as
segmentations defined as a set of planar regions stored as
quantitative measurements and qualitative evaluations for the regions defined by DICOM Segmentations, those will be stored as a specific type of objects that follows DICOM SR template (SR-TID1500)
Fedorov, A., Clunie, D., Ulrich, E., Bauer, C., Wahle, A., Brown, B., Onken, M., Riesmeier, J., Pieper, S., Kikinis, R., Buatti, J. & Beichel, R. R. DICOM for quantitative imaging biomarker development: a standards based approach to sharing clinical data and structured PET/CT analysis results in head and neck cancer research. PeerJ 4, e2057 (2016).
Fedorov, A., Hancock, M., Clunie, D., Brochhausen, M., Bona, J., Kirby, J., Freymann, J., Pieper, S., J W L Aerts, H., Kikinis, R. & Prior, F. DICOM re-encoding of volumetrically annotated Lung Imaging Database Consortium (LIDC) nodules. Med. Phys. (2020).
The OHIF and SliM viewers do not support 32 bit browsers.
The main functions of the viewer are available via the toolbar controls shown below.
The functionality supported by those tools should be self-explanatory, or can be discovered via quick experimentation.
IDC Viewer supports visualization of DICOM Segmentation objects (SEG) and DICOM Radiotherapy Structure Sets (RTSTRUCT). When available in a given study, you will see those modalities labeled as such in the left-hand panel of the viewer, as shown below. To see a specific SEG or RTSTRUCT, double-click on the corresponding thumbnail. After that you can open the RTSTRUCT/SEG panel in the upper right corner to jump to the locations of the specific structure sets or segments, and to control their individual visibility.
Note that certain modalities, such as Segmentation (SEG) and Real World Value Mapping (RWVM) objects, cannot be selected for visualization from the IDC Portal. SEG can only be viewed in the context of the image series segmented, and RWVM series are not viewable and will not show up in the left panel of the viewer.
Below is an example of series objects that are not viewable at the series level.
The IDC pathology viewer allows for interactive visualization of digital slide microscopy (SM) images.
Here are some specific examples, taken from the IDC Portal dashboard:
You can share the viewer URLs if you want to refer to visualizations of the specific items from IDC. You can also use this functionality if you want to visualize specific items from your notebook or a custom dashboard (e.g., a Google DataStudio dashboard).
If you want to visualize your own images, or if you would like to combine IDC images with the analysis results or annotations you generated, you do have several options:
Numbers in the grayed ovals next to the search filters indicate the total number of cases (patients) that have the specific attribute
Click on the "i" button to toggle information panel about the individual items in the search panels
Cohort filters panel: get the shareable URL for the current selection by clicking "URL" button in the Cohort Filters panel
Get the manifest for downloading all of the matching studies by clicking "Manifest" button in the Cohort Filters panel
Search results are updated dynamically based on the search configuration. At any time you can expand the items on the right to explore the selected collections, cases, studies and series.
Studies and series tables include the button to open those in the browser-based image viewer.
Our DICOMWeb endpoint should only be used when data access needs cannot be satisfied using other mechanisms (e.g., when accessing individual frames of the microscopy images without having to download the entire binary file).
Egress of data via the DICOMweb interface is capped at a non-disclosed limit that is tracked per IP. It is not acceptable to “IP hop” in an attempt to circumvent individual daily quotas, since there is also a global daily cap as well to prevent full egress of the imaging collection. Note that if this global cap is hit, all other users of the site would be unable to use the viewers for the rest of the day (using the UTC clock). Thus, IP hopping against the proxy that causes the global quota to be hit will be considered a denial-of-service attack.
If you reach your daily quota, but feel you have a compelling cancer imaging research use case to request an exception to the policy and an increase in your daily quota, please reach out to us at support@canceridc.dev to discuss the situation.
We are continuously monitoring the usage of the proxy. Depending on the actual costs and usage, this policy may be revisited in the future to restrict access via the DICOMweb interface for any uses other than IDC viewers.
IDC Portal offers lots of flexibility in selecting items to download. In all cases, download of data from IDC Portal is a two step process:
Select items and export a manifest corresponding to your selection.
"IDC manifest" is a text file that contains URLs to the files in cloud buckets that correspond to your selection. It will contain one line for each DICOM series, as IDC files are organized in series-level folders in the cloud storage.
You will see "Cart" icon in the search results collections/cases/studies/series tables. Any of the items in these tables can be added to the cart for subsequent downloading of the corresponding files.
Get the manifest for the cart content using "Manifest" button in the Cart panel.
Clicking "Manifest" button in the "Cohort Filters" panel will given you the manifest for all of the studies that match your current selection criteria.
Studies table contains a button for downloading manifest that will contain references to the files in the given study. To download a single series, no manifest is needed. You will see the command line to run to do the download.
If you would like to download the entire study, or the specific image you see in the image viewer, you can use the download button in the viewer interface.
The version of the viewer is available from the "About" menu for the OHIF (radiology) viewer, and "Get app info" menu for the Slim (pathology) viewers. Both of those menus are in the upper right corner of the window.
The final OHIF v2 published version is 4.12.45. Upstream changes based on v2 will be accessible through the v2-legacy branch (will not be published to NPM).
Main highlights from v2-legacy since 4.12.45:
Fix high and critical dependency issues reported by dependabot
Update SEG tolerance popup and update SEG thumbnail warning: Jump to first segment item image and show warning message only once on onChange events
Update to issues and PR templates
Address segmentation visibility toggle applied to all segmentations instead of the active one only
Update dcmjs version so it throws 'Failed to find the reference image in the source data. Cannot load this segmentation' error instead of logging a warning to console
Address eye icon for segment not shown when segment name is long
Change message for segmentation when it fails to load duo to orientation tolerance
Main highlights of this release include:
Handle missing ReferencedInstanceSequence attribute: Update parsing logic to consider attribute as optional.
Main highlights of this release include:
Remove unused code from DICOM SR parsing: Remove referecenedImages attribute from SR display sets. Within TID 1500, sub-template TID 1600 (Image Library) is not required while parsing SR for image references for annotations and planar measurements. The same is obtained from sub template TID 1501>TID 300>TID 320.
Main highlights of this release include:
Update message for segmentation error loading due to orientation tolerance
Main highlights of this release include:
Correct Parsing Logic for Qualitative Instance Level SR
Main highlights of this release include:
Fix 2d MPR rendering issue for the sagittal view
New Features
Support configuration of multiple origin servers for different types of DICOM objects (SOP Storage Classes)
Enhancements
Improved error handling
Check Pyramid UID (if available) when grouping images into digital slides
Bug Fixes
Use Acquisition UID (if available) to group images into digital slides
Main highlights of this release include:
New features
Add new tool to go to specific slide position;
Show mouse position in slide coordinate system.
Enhancements
Improve performance of translations between image and slide coordinates;
Automatically adjust size of overview image to size of browser window.
Bug fixes
Fix rendering of label image;
Show error message when creation of viewer fails;
Fix resolution of overview image;
Fix styling of point annotations;
Ensure bounding box annotations are axis aligned;
Add missing keyboard shortcut for navigation tool.
Main highlights of this release include:
Fix parsing of qualitative slice annotation;
Disable measurements panel interactions in MPR mode;
Fix parsing of segmentation when orientation values are close to zero;
Raise error if a frame StudyInstanceUID, SeriesInstanceUID and SOPInstanceUID are not conforming with the UID (DICOM UI VR) character repertoire;
Implements runtime tolerance for SEGs loading retry;
Fixed popup notifications behavior;
Update cornerstoneWADOImageLoader.
Main highlights of this release include:
New features
Add panel for clinical trial information to case viewer;
Sort digital slides by Container Identifier attribute.
Enhancements
Reset style of optical paths to default when deactivating presentation state.
Bug fixes
Fix rendering of ROI annotations by upgrading to React version 1;
Correctly update UIDs of visible/active optical paths;
Fix type declarations of DICOMweb search resources.
Main highlights of this release include:
Add support for SR qualitative annotation per instance.
Main highlights of this release include:
New features
Support DICOM Advanced Blending Presentation State to parametrize the display of multiplexed IF microscopy images;
Add key bindings for annotations tools;
Enable configuration of tile preload;
Enable configuration of annotation geometry type per finding;
Expose equipment metadata in user interface.
Enhancements
Improve default presentation of multiplexed IF microscopy images in the absence of presentation state instances;
Correctly configure DCM4CHEE Archive to use reverse proxy URL prefix for BulkDataURI in served metadata;
Enlarge display settings interfaces and add input fields for opacity, VOI limits, and colors;
Update dicom-microscopy-viewer version to use web workers for frame decoding/transformation operations;
Add button for user logout;
Disable optical path selection when a presentation state has been selected.
Bug fixes
Fix parsing of URL path upon redirect after successful authentication/authorization;
Fix configuration of optical path display settings when switching between presentation states;
Fix caching of presentation states and for selection via drop-down menu.
Security
Update dependencies with critical security issues.
Main highlights of this release include:
Enhancements
Make overview panel collapsible and hide it entirely if lowest-resolution image is too large.
Bug fixes
Fix update of optical path settings when switching between slides.
Main highlights of this release include:
Fix regression in logic for finding segmentations referenced source image;
Fix segmentations loading issues;
Fix thumbnail series type for unsupported SOPClassUID;
Fix toolbar error when getDerivedDatasets finds no referenced series are found.
Main highlights of this release include:
New features
Display of analysis results stored as DICOM Segmentation, Parametric Map, or Microscopy Bulk Simple Annotations instances;
Dynamic selection of DICOMweb server by user (can be enabled by setting AppConfig.enableServerSelection to true);
Dark app mode for fluorescence microscopy (can be enabled by setting App.mode to "dark");
Support display of segments stored in DICOM Segmentation instances;
Support display of parameter mappings stored in DICOM Parametric Map instances;
Support display of annotation groups stored in DICOM Microscopy Bulk Simple Annotations instances;
Implement color transformations using ICC Profiles to correct color images client side in a browser-independent manner;
Implement grayscale transformations using Palette Color Lookup Tables to pseudo-color grayscale images.
Improvements
Unify handling of optical paths for color and grayscale images;
Add loading indicator;
Improve styling of overview map;
Render specimen metadata in compacter form;
Improve fetching of WASM library code;
Improve styling of slide viewer sidebar;
Sort slides by Series Number;
Work around common standard compliance issues;
Update docker-compose configuration;
Upgrade dependencies;
Show examples in README;
Decode JPEG, JPEG 2000, and JPEG-LS compressed image frames client side in a browser-independent manner;
Improve performance of transformation and rendering operations using WebGL for both grayscale as well as color images;
Optimize display of overview images and keep overview image fixed when zooming or panning volume images;
Optimize HTTP Accept header field for retrieval of frames to work around issues with various server implementations.
Bug fixes
Ensure ROI annotations are re-rendered upon modification;
Clean up memory and recreate viewers upon page reload;
Fix selection of volume images;
Fix color space conversion during decoding of JPEG 2000 compressed image frames;
Fix unit of area measurements for ROI annotations;
Publish events when bulkdata loading starts and ends.
Main highlights of this release include:
Improve logic for finding segmentations referenced source image;
Improve debug dialog: fix text overflow and adding active viewports referenced SEGs and RTSTRUCT series.
Main highlights of this release include:
Fix fail to load SEG related to geometry assumptions;
Fix fail to load SEG related to tolerance;
Add initial support for SR planar annotations.
Main highlights of this release include:
Bug fixes
Fix selection of VOLUME or THUMBNAIL images with different Photometric Interpretation.
Main highlights of this release include:
Fix RTSTRUCT right panel updates;
Fix SEG loading regression.
Main highlights of this release include:
Fix handling of datasets with unsupported modalities;
Fix backward fetch of images for the current active series.
Fix tag browser slider.
Main highlights of this release include:
Bug fixes
Rotate box in overview map outlining the extent of the current view together with the image.
Main highlights of this release include:
Fix segmentation/rtstruct menu badge update when switching current displayed series;
Add to series thumbnail link icon if they are connected to any annotation (segmentation, etc...);
Fix problems opening series when the study includes many series;
Fix segments visibility handler.
Main highlights of this release include:
Improvements
Include images with new flavor THUMBNAIL in image pyramid;
Properly fit overview map into HTML element and disable re-centering of overview map when user navigates main map;
Allow drawing of ROIs that extent beyond the slide coordinate system (i.e., allow negative ROI coordinates).
Bug fixes
Prevent display of annotation marker when ROI is deactivated
Main highlights of this release include:
Fix issues with segmentation orientations;
Fix display of inconsistencies warning for segmentation thumbnails;
Fix throttle thumbnail progress updates.
Main highlights of this release include:
Bug fixes
Set PUBLIC_URL in Dockerfile.
Main highlights of this release include:
Improvements
Add button to display information about application and environment;
Add ability to include logo;
Verify content of SR documents before attempting to load annotations;
Improve re-direction after authentication;
Add retry logic and error handlers for DICOMweb requests;
Improve documentation of application configuration in README;
Add unit tests.
Bug fixes
Disable zoom of overview map;
Fix pagination of worklist;
Prevent delay in tile rendering.
Main highlights of this release include:
Handle uncaught exception for non TID 1500 sr;
Added display of badge numbers in the segmentation / rtstruct panel tabs;
Study prefetcher with loading bar.
Main highlights of this release include:
New features
Support for multiplexed immunofluorescence slide microscopy imaging;
Client-side additive blending of multiple channels using WebGL;
Client-side decoding of compressed frame items using WebAssembly based on Emscripten ports of libjpeg-turbo, openjpeg, and charls C/C++ libraries.
Improvements
Continuous integration testing pipeline using circle CI;
Deploy previews for manual regression testing.
Major changes
Introduce new configuration parameter renderer.
Main highlights of this release include:
Add exponential backoff and retry after 500 error;
Update to HTML SR viewport to display missing header tags.
Inital Release.
Main highlights of this release include:
Add disable server cache feature;
Additional improvements on series inconsistencies report UI.
Main highlights of this release include:
Add acquisition storage SR sopclass to SR html ext;
Fix missing items in the segmentation combobox items at loading;
Fix slices are not sorted in geometrical order;
Extend series inconsistencies checks to segmentation and improve UI.
Main highlights of this release include:
Add new log service to be used by debugger extension;
Add UI to communicate to the users inconsistencies within a single series;
Add time in the dates of the items of the segmentation combobox list;
Order segmentation combobox list in reverse time order;
Fix failure to load a valid SEG object because of incorrect expectations about ReferencedSegmentNumber;
Fix RSTRUCT menu visibility when loading a series;
Fix image load slowness regression;
Fix choppy scrolling in 2D mod;
Fix failure to load segmentations when filtering study with '?seriesInstanceUID=' syntax.
Main highlights of this release include:
Replace instance dropdown to slider for dicom tag browser;
Add error page and not found pages if failed to retrieve study data.
Main highlights of this release include:
Add UI error report for MPR buffer limit related errors;
Add UI error report for hardware acceleration turned off errors;
Add IDC funding acknowledgment;
Fix RSTRUCT menu panel undefined variables;
Fix RSTRUCT menu visibility when loading a series;
Fix segments visibility control (SEG menu) bugs .
Main highlights of this release include:
Visualize overlapping segments;
Use runtime value configuration to get pkg version;
Fix navigation issues in the crosshair tool.
Main highlights of this release include:
Add MPR crosshair tool.
The IDC API is based on IDC Data Model concepts. Several of these concepts have been previously introduced in the context of the IDC Portal. We discuss these concepts here with respect to the IDC API.
As described previously, IDC data is versioned such that searching an IDC version according to some criteria (some filter set as described below) will always identify exactly the same set of DICOM objects.
The GET /versions API endpoint returns a list of the current and previous IDC data versions.
An original collection is a set of DICOM data provided by a single source. (We usually just use collection to mean original collection.) Such collections are comprised primarily of DICOM image data that was obtained from some set of patients. However some original collections also include annotations, segmentations or other analyses of the image data in the collection. Typically, the patients in an collection are related by a common cancer type, though this is not always the case.
The GET /collections endpoint returns a list of the original collections, in the current IDC version. Some metadata about each collection is provided.
Analysis results are comprised of DICOM data that was generated by analyzing data in one or more original collections. Typically such analysis is performed by a different entity than that which provided the original collection(s) on which the analysis is based. Examples of data in analysis collections include segmentations, annotations and further processing of original images.
Because a DICOM instance in an analysis result is "in" the same series and study as the DICOM instance data of which it is an analysis result, it is also "in" the same patient, and therefore is considered to be "in" the same collection.
Specifically, each instance in IDC data has an associated collection_id. An analysis result will have the same collection_id as the original collection of which it is an analysis result.
The GET /analysis_results endpoint returns a list of the analysis results, with some metadata, in the current IDC version.
A filter set selects some set of DICOM objects in IDC hosted data, and is a set of conditions, where each condition is defined by an attribute and an array of values. An attribute identifies a field (column) in some data source (BQ table). Each filter set also includes the IDC data version upon which it operates.
Filter sets are JSON encoded. Here is an example filter set:
A filter set selects a DICOM instance if, for every attribute in the filter set, the instance's corresponding value satisfies one or more of the values in the associated array of values. This is explained further below.
For example, the (attribute, [values]) pair ("Modality", ["MR", "CT"]) is satisfied if an instance "has" a Modality of MR or CT.
Note that if a filter set includes more than one (attribute, [values]) pair having the same attribute, then only the last such (attribute, [values]) pair is used. Thus if a filter group includes the (attribute, [values]) pairs ("Modality", ["MR"]) and ("Modality", ["CT"]), in that order, only ("Modality", ["CT"]) is used.
The filter set above will select any instance in the current IDC version that is in the TCGA_KIRC collection or the TCGA_LUAD' collections. To be selected by the filter, an instance must also have a Modality of CT or MR, and an age_at_diagnosis value between 65 and 75 .
Because of the hierarchical nature of DICOM, if a filter set selects an instance, it implicitly selects the series, study, patient and collection which contain that instance. A manifest can be configured to return data about some or all of these entities.
Note that when defining a cohort through the API, the IDC version is always the current IDC version.
IDC maintains a set of GCP BigQuery (BQ) tables containing various types of metadata that together describe IDC data.
In the context of the API, a data source (or just source) is a BQ table that contains some portion of the metadata against which a filter set is applied. An API query to construct a manifest is performed against one or more such tables as needed.
Both the IDC Web App and API expose selected fields against which queries can be performed. The /filters endpoint returns the available filter attributes The /filters/values/{filter} endpoint returns a list of the values which a specified Categorical String or Categorical Numeric filter attribute will match. Each attribute has a data type, one of:
String: An attribute with data type String may have an arbitrary string value. For example, the possible values of a StudyDescription attribute are arbitrary. An object is selected if its String attribute matches any of the values in the values array. Matching is insensitive to the case (upper case, lower case) of the characters in the strings. Thus ("StudyDescription",["PETCT Skull-Thigh"] will match a StudyDescription containing the substring "PETCT SKULL-THIGH", or "petct skull-thigh" etc. Pattern matching in String attributes is also supported. The ('StudyDescription",["%SKULL%", "ABDOMEN%", "%Pelvis"]) filter will match any StudyDescription that contains "SKULL", "skull", "Skull", etc., starts with "ABDOMEN", "abdomen", etc., or ends with "Pelvis", "PELVIS", etc.
Categorical String An attribute with data type Categorical String will have one of a defined set of string values. For example, Modality is an Categorical String attribute that has possible values 'CT', 'MR', 'PT', etc. Categorical String attributes have the same matching semantics as for Strings. The /filters/values/{filter} endpoint returns a list of the values accepted for a specified Categorical String attribute (filter).
Categorical Numeric An attribute with data type Categorical Numeric has one of a defined set of numeric values. The corresponding value array must have a single numeric value. The (attribute, value array) pair for a Categorical Numeric is satisfied if the attribute is equal to the value in the value array. The /filters/values/{filter} endpoint returns a list of the values accepted for a Categorical Numeric attribute (filter).
Ranged Integer An attribute with data type Ranged Integer will have an integer value. For example, age_at_diagnosis is an attribute of data type Ranged Integer. In order to enable relative numeric queries, the API exposes eight variations of each Ranged Integer attribute as filter attribute names. These variations are the base attribute name with one of the suffixes: eq, gt, gte, btw, btwe, ebtw, ebtwe, lte, or lt, e.g. age_at_diagnosis_eq. The value array of the btw, btwe, ebtw, and ebtwe variations must contain exactly two integer values, in numeric order (least value first). The value array of the eq, gt, gte, lte, and lt variations must contain exactly one integer values. The (attribute, value array) pair for a Ranged Integer attribute is satisfied according to the suffix as follows:
eq: If an attribute is equal to the value in the value array
gt: If an attribute is greater than the value in the value array
gte: If an attribute is greater than or equal to the value in the value array
btw: if an attribute is greater than the first value and less than the second value in the value array
ebtw: if an attribute is greater than or equal to the first value and less than the second value in the value array
btwe: if an attribute is greater than the first value and less than or equal to the second value in the value array
ebtwe: if an attribute is greater than or equal to the first value and less than or equal to the second value in the value array
lte: If an attribute is less than or equal to the value in the value array
lt: If an attribute is less than the value in the value array
Ranged Number An attribute with data type Ranged Number will have a numeric (integer or float) value. For example, diameter is an attribute of data type Ranged Number. In order to enable relative numeric queries, the API exposes eight variations of each Ranged Number attribute as filter attribute names. These variations are the base attribute name with one of the suffixes: eq, gt, gte, btw, btwe, ebtw, ebtwe, lte, or lt, e.g. diameter_eq. The value array of the btw, btwe, ebtw, and ebtwe variations must contain exactly two numeric values, in numeric order (least value first). The value array of the eq, gt, gte, lte, and lt variations must contain exactly one numeric values. The (attribute, value array) pair for a Ranged Number attribute is satisfied according to the suffix as follows:
eq: If an attribute is equal to the value in the value array
gt: If an attribute is greater than the value in the value array
gte: If an attribute is greater than or equal to the value in the value array
btw: if an attribute is greater than the first value and less than the second value in the value array
ebtw: if an attribute is greater than or equal to the first value and less than the second value in the value array
btwe: if an attribute is greater than the first value and less than or equal to the second value in the value array
ebtwe: if an attribute is greater than or equal to the first value and less than or equal to the second value in the value array
lte: If an attribute is less than or equal to the value in the value array
lt: If an attribute is less than the value in the value array
The API supports defining and saving cohorts, as well as accessing the user's previously saved cohorts, whether defined through the portal or the API. Through the API, the user can obtain information about their previously defined cohorts, including the definition of each cohort in terms of a filter set and IDC version. The user can also obtain a manifest of the objects in the cohort. The data in the manifest is highly configurable and can be used, with suitable tools, to obtain DICOM files from cloud storage. A manifest returned by the API can include values from a large set of fields.
The POST /cohorts API endpoint creates and saves a cohort as defined by a set of filters and other cohort metadata. Here is an example JSON encoded cohort definition. :
Note that the cohort definition does not include an idc_data_version, because the cohort's version is implicitly the current IDC version when defining a new cohort.
The new cohort is saved under the IDC account of the caller of the API endpoint. The GET /cohorts API endpoint returns a list of the currently saved cohorts of the caller.
The DELETE /cohorts/{cohort_id} endpoint deletes a cohort as specified by its cohort_id. The DELETE /cohorts API endpoint deletes zero or more cohorts as specified by a list of cohort_ids. A user may only delete their own cohorts.
Some of the API calls require authentication. This is denoted by a small lock symbol. Authentication can be performed by clicking on the ‘Authorize’ button at the top right of the page.
The API will return collection metadata for the current IDC data version.
Request Response
The Swagger UI submits the request and shows the curl code that was submitted. The Response body section will display the response to the request. The expected JSON schema format of the response to this API request is shown below:
The actual JSON formatted response can be downloaded to your local file system by clicking the ‘Download’ button.
This API is designed for use by developers of image analysis and data mining tools to directly query the public resources of the IDC and retrieve information into their applications. The API complements the IDC web application but eliminates the need for users to visit the IDC web pages to perform cohort creation, manifest export, and transfer of image data to some local file system.
The API is a RESTful interface, accessed through web URLs. There is no software that an application developer needs to download in order to use the API. The application developer can build their own access routines using just the API documentation provided. The interface employs a set of predefined query functions that access IDC data sources.
The IDC API is intended to enable exploration of IDC hosted data without the need to understand and use the Structure Query Language (SQL). To this end, data exploration capabilities through the IDC API are limited. However, IDC data is hosted using the standard capabilities of the the Google Cloud Platform (GCP) Storage (GCS) and BigQuery (BQ) components. Therefore, all of the capabilities provided by GCP to access GCS storage buckets and BQ tables are available for more advanced interaction with that data.
The API will return collection metadata for the current IDC data version. The request can be run by clicking on the ‘Execute’ button.
The Swagger UI submits the request and shows the curl command that was submitted. The ‘Response body’ section will display the response to the request. The expected format of the response to this API request is shown below:
The actual JSON formatted response can be downloaded by selecting the ‘Download’ button.
Some of the API calls require authentication. This is denoted by a small lock symbol. Authentication can be performed by clicking on the ‘Authorize’ button at the top right of the page.
The syntax for all of API data structures is detailed at the bottom of the UI page.API Endpoints
The API can be accessed from the command line using curl or wget. Here we discuss using curl for this purpose.
You access an API endpoint by sending an HTTP request to the IDC API server. The server replies with a response that either contains the data you requested, or a status indicator. An API request URL has the following structure:
<BaseURL><API version><QueryEndpoint>?<QueryParameters>.
The <BaseURL> of the IDC API is https://api.imaging.datacommons.cancer.gov.
For example, this curl command requests metadata on all IDC collections from the V2 API:
curl -X GET "https://api.imaging.datacommons.cancer.gov/v2/collections" -H "accept: application/json"
Note, also, that the HTTP method defaults to GET. However, a POST or DELETE HTTP method must be specified with the -X parameter.
The IDC API UI displays the curl commands which it issues and thus can be a good reference when constructing your own curl commands.
Some of the API endpoints, such as /collections and /cohorts/preview, can be accessed without authorization. APIs that access user specific data, such as saved cohorts, necessarily require account authorization.
To access those APIs that require IDC authorization, you will need to generate a credentials file. To obtain your credentials:
Execute the idc_auth.py
script, e.g.:
$ python ./idc_auth.py
Refer to the idc_auth.py
file for detailed instructions.
$ TOKEN=$(more ~/.idc_credentials| jq -r '.["token_response"]["id_token"]')
and can be used to authenticate to the API to get a list of your cohorts:
$ curl -X GET "https://api.imaging.datacommons.cancer.gov/v2/cohorts" -H "accept: application/json" -H "Authorization: Bearer $TOKEN"
If you pipe the result to jq:
$ curl -X GET "https://api.imaging.datacommons.cancer.gov/v2/cohorts" -H "accept: application/json" -H "Authorization: Bearer $TOKEN" | jq
Then you should see something like this:
In Python, we can issue the following request to obtain a list of the collections in the current IDC version:
The /cohorts/manifest/preview and /cohorts/manifest/{cohort_id} endpoints are paged. That is, several calls of the API may be required to return all the data resulting from such a query. Each endpoint accepts a page_size parameter in the manifestBody or manifestPreviewBody that is the maximum number of rows that the client wants the server to return. The returned data from each of these APIs includes a next_page value. next_page is null if there is no more data to be returned. If next_page is non-null, then more data is available.
In the case that the returned next_page value is not null, the /cohorts/manifest/nextPage or /cohorts/manifest/preview/nextPage endpoint can be accessed, passing the next_page token returned by the previous call.
The manifest endpoints may return an HTTP 202 error. This indicates that the request was accepted but processing timed out before it was completed. In this case, the client should resubmit the request including the next_page token that was returned with the error response.
The version of the portal is shown at the bottom of the portal page. The semantics of the version is the following:
canceridc.<date of webapp deployment in YYYYMMDDHHMM>.<first 6 characters of the commit hash>,
on the Explore Images page the IDC internal id for each collection can now be copied from the Collections table by clicking the corresponding copy icon
on the Explore Images page the IDC case id can now be copied from the Selected Cases table by clicking the corresponding copy icon
Main highlights of this release include:
add a choice of several viewers (OHIF v2, OHIF v3, VolView, Slim) for viewing image files
Main highlights of this release include:
s5cmd file manifests can now be generated from the Explore images page for individual studies and series
Main highlights of this release include:
The file manifest for a filter can be downloaded without logging into the portal and creating a persistent cohort
Main highlights of this release include:
Main highlights of this release include:
Three new Original Image attributes Max Total Pixel Matrix Columns, Max Total Pixel Matrix Rows, and Min Pixel Spacing are added.
Two new Quantitative Analysis attributes Sphericity (Quant) and Volume of Mesh are added.
Default attribute value order is changed from alphanumeric (by value name) to value count.
Main highlights of this release include:
As limited access collections have been removed from IDC, the portal is now simplified by removing the option of selecting different access levels. All collections in the portal are public.
A warning message appears on the cohort browser page when a user views a cohort that used the Access filter attribute. That attribute is no longer applied if the user migrates the cohort to the current version.
On the explorer page the reset button has been moved to improve viewability.
This was primarily a data release. There were no significant changes to the portal.
Main highlights of this release include:
User control over how selection of multiple filter modalities defines the cohort. Previously when multiple modalities were selected the cohort would include the cases that had ANY of the selected modalities. Now the user can choose if the cohort includes the cases that contain ANY of the selected modaltiies or just those that have ALL of the selected modalities.
Main highlights of this release include:
Ability to select specific Analysis Results collections with segmentation and radiomic features
Text boxes added to the slider panels to allow the user to input upper and lower slider bounds
Pie chart tooltips updated to improve viewability
Main highlights of this release include:
Eleven new collections added
Number of cases, studies, and series in a cohort are reported in the filter de finition
On the Exploration page the Access attribute is placed in the Search Scope
On the Exploration page users are warned when they create a cohort that includes Limited Access collections
Series Instance UID is reported in the Selected Series table
Main highlights of this release include:
The BigQuery query string corresponding to a cohort can now be displayed in user-readable format by pressing a button on either the cohort or cohort list pages
On the exploration page collections can now be sorted alphabetically or by the number of cases. Selected cases are ordered at the top of the collection list
Table rows can be selected by clicking anywhere within the row, not just on the checkbox
The BigQuery export cohort manifest includes the IDC data version as an optional column
Main highlights of this release include:
Collections which have limited access are now denoted as such in the Collection tab on the Exploration page
Links to image files belonging to limited collections have been removed from the Studies and Series tables on the Exploration page
The quota of image file data that can be served per user per day has been reduced from 137 to 40 GB
Main highlights of this release include:
New attributes including Manufacturer, Manufacturer Model Name, and Slice Thickness added
Checked attribute values are now shown at the top of the attribute value lists
Ability to search by CaseID added to the Selected Cases table
Ability to search by StudyID added to the Selected Studies table
Study Date added to the Studies Table
Changed the persistence of the StudyID tooltip in the tables so that the StudyID can be copied from the tooltip
Specific columns can now be selected in the BigQuery cohort export
The Imaging Data Commons Explore Image Data portal is a platform that allows users to explore, filter, create cohorts, and view image studies and series using cutting-edge technology viewers.
Main highlights of this release include:
Support for slide microscopy series from the CPTAC-LSCC and CPTAC-LUAD collections is now included.
Search boxes are included for very attribute to search for specific attribute values by name.
The Imaging Data Commons Explore Image Data portal is a platform that allows users to explore, filter, create cohorts, and view image studies and series using cutting-edge technology viewers.
Main highlights of this release include:
112 data collections are now included
Cohort data version is reported
Cohort statistics - ie the number the cases, studies, and series per cohort are now reported
Mechanism included to update a version cohort
Species Attribute is included
Checkbox and plus/minus icons are now used to select table rows
The Imaging Data Commons Explore Image Data portal is a platform that allows users to explore, filter, create cohorts, and view image studies and series using cutting-edge technology viewers.
Main highlights of this release include:
The user details page will no longer return a 500 error when selected
Sorting of studies panel is now active for all fields
Re-sending of an unreceived verification email is now more clearly explained.
IDC identity login header and column selection is disabled for the exportation of a cohort manifest to BigQuery
Detailed information panel added to efficiently describe why some pie charts have multiple facets even when a filter is selected
Cohort manifest export popup can be scrolled down
Use of Shift or Control (Command for Mac) selection of studies will now behave as expected: Shift-select for a contiguous series of rows, Control/Command-select for individual rows.
All filter selections are now sorted by alphabetical characters
The Imaging Data Commons Explore Image Data portal is a platform that allows users to explore, filter, create cohorts, and view image studies and series using cutting-edge technology viewers.
Main highlights of this release include:
Consistent number of files will be returned between the portal and BigQuery
When the user clicks a non-gov link a popup will appear
Cohort manifest export information now has clickable URLs to take you to the BigQuery console
Collections list displays by default 100 entries
Any empty search criteria is now highlighted in grey and no data will be listed
The user will no longer need to scroll to see search criteria in the left search configuration panel
Portal footer is now in compliance with NCI requirements
Check/uncheck in the collections panel added for collection TCGA
The Imaging Data Commons Explore Image Data portal is a platform that allows users to explore, filter, create cohorts, and view image studies and series using cutting-edge technology viewers.
Main highlights of this release include:
Case-level table is added to the portal
Cohorts can now be exported into BigQuery tables using the Export Cohort Manifest button
Cohorts less than 650k rows can now be downloaded as a multipart file. Cohorts larger that 600k rows can only be be exported to BigQuery (for users that are logged in with Google Accounts)
Quantitative filter ranges are updated dynamically with the updates to filter selection
Pie charts will display "No data available" message when zero cases are returned for the given filter selection
RTPLAN and Real World Mapping Attribute values are now disabled at the series level, since they cannot be visualized in the IDC Viewer
Various bug fixes in both the IDC Portal and IDC Viewer
The Imaging Data Commons Explore Image Data portal is a platform that allows users to explore, filter, create cohorts, and view image studies and series using cutting-edge technology viewers.
Main features in this initial release include:
The ability to search for data in BigQuery and Solr
The ability to search by multiple attributes:
Collection
Original attributes e.g., Modality
Derived attributes e.g., Segmentations
Qualitative analysis e.g., Lobular pattern
Quantitative analysis e.g., Volume
related attributes e.g., Country
Display of collections results in a tabular format with the following information:
Collection Name
Total Number of Cases
Number of Cases(this cohort)
Display of the Selected Studies results in tabular format with the following information:
Project Name
Case ID
Study ID
Study Description
Display of the Selected Series results in tabular format with the following information:
Study ID
Series Number
Modality
Body Part Examined
Series Description
The ability to hide attributes with zero cases present
The ability to save cohorts
The ability to download the manifest of any cohort created
The ability to promote, filter, and load multiple series instances in the OHIF viewer
IDC integrates two different viewers, which will be used depending on the type of images being opened. Visualization of radiology images uses the open-source v3. The is used for visualization of pathology and slide microscopy images. We customized both of those viewers slightly to add features specific to IDC. You can find all of those modifications in the respective forks under the IDC GitHub organization for OHIF and SliM viewers: and . IDC Viewer is opened every time you click the "eye" icon in the study or series table of the IDC Portal.
IDC Viewer is a "zero-footprint" client-side viewer: before you can see the image in the viewer, it has to be downloaded to your browser from the IDC DICOM stores. IDC Viewer communicates the data it receives through a proxy via the interface implemented in GCP .
Currently, IDC Viewer proxy limits the amount of data that can be downloaded in one day to 137 GB per IP address, and enforces a total quota per day over all of the IP addresses. If the quota is exhausted, you will not be able to see any images in IDC Viewer until the limit is reset and instead will be redirected to ! We may adjust the current proxy limits in the future, and you are welcome to provide your feedback on the appropriateness of the current quota in .
If you want to report a problem related to visualization of a specific study in the IDC Viewer, please use the "Debug Info" tool to collect debugging information. Please report the issue on the , including the entire content of the debugging information to help us investigate the issue.
You can use IDC Viewer to visualize any of the suitable data in IDC. To configure the IDC Viewer URL, simply append StudyInstanceUID
of a study available in IDC to the following prefix: (for the radiology viewer) and/ (for the digital pathology viewer). This will open the entire study in the viewer. You can also configure the URL to open specific series of the study, as defined by the list of SeriesInstanceUID
items. When you open the IDC Viewer from the IDC Portal, the URLs of the pages will be populated following those conventions.
open entire study with the StudyInstanceUID
1.3.6.1.4.1.14519.5.2.1.6279.6001.224985459390356936417021464571: .
open the specified subset of series from the study above:
Digital pathology viewer uses a slightly different convention, as should be evident from this example URL:
You can use Google FireCloud to deploy v2 radiology or microscopy viewers as web applications, without having to use virtual machines or docker, and for free!
If you want to visualize images inside a Colab/Jupyter notebook - you can use - details in
You can use open source zero-footprint viewer to visualize and volume render any image series by simply pointing it to the cloud bucket with the files - see details in
You can copy identifiers of the individual collections, cases, studies or series to the clipboard - those can be used to download corresponding files as discussed in the section - using command-line download tool or 3D Slicer IDC extension
TL;DR: if you want to download images from IDC, you can do it without charge, limits or sign-ins from our cloud storage buckets. See instructions in .
The primary mechanism for accessing data from IDC is by searching the metadata using the idc-index python package or BigQuery tables, and downloading the binary files from public cloud buckets, as discussed in . There is no limit, quota or fees associated with downloading IDC files from the buckets.
Effective March 2024, as a pilot project, IDC also provides access to the DICOM data via the DICOMweb interface available at this endpoint: . This endpoint is read-only. It will route the requests to the Google Healthcare API DICOM store containing IDC data.
Use command-line python tool or 3D Slicer IDC browser extension to download the files for your selection, as discussed in .
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
Slim is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).
The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).
The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).
The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .
Filters sets were previously introduced in . In this section we describe how filter sets are specified to the API.
The can be used to see details about the syntax for each call, and also provides an interface to test requests. Each endpoint is also documented the section.
For a quick demonstration of the syntax of an API call, test the request. You can experiment with this endpoint by clicking the ‘Try it out’ button, and then the 'Execute' button.
This section describes version 2 of the IDC REST API . The documentation for the version 1 API will be found .
The IDC API conforms to the specification which "defines a standard, language-agnostic interface to RESTful APIs which allows both humans and computers to discover and understand the capabilities of the service without access to source code, documentation, or through network traffic inspection."
If you have feedback about the desired features of the IDC API, please let us know via the IDC .
is a web based interface that allows users to try out APIs and easily view their documentation. You can access the IDC API SwaggerUI .
This serves as an interactive tutorial to accessing the IDC API using Python.
The can be used to see details about the syntax for each call, and also provides an interface to test requests.
For a quick demonstration of the syntax of an API call, test the request. You can experiment with this endpoint by clicking the ‘Try it out’ button.
Clone the script to your local machine.
The is useful when dealing with JSON in the command line context. Assuming jq is installed, and that idc_auth.py has created the credentials file ~/.idc_credentials
(the default location), then the following will extract the id token to a variable:
We expect that most API access will be programmed access, and, moreover, that most programmed access will be within a Python script using the . This usage is covered in detail (along with details on each of the IDC API endpoints) in the Google Colab notebook. Here we provide just a brief overview.
where revision hash corresponds to that of the .
The Export Cohort Manifest popup now includes options to download manifests that can be used by to download image files from IDC's s3 buckets in GCP or AWS. Instructions are provided for using s5cmd with these manifests
The is now configured to view slide microscopy series
A manifest is a table of access methods and other metadata of the objects in some cohort. There are two manifest endpoints. The POST /cohorts/manifest/{cohort_id} API endpoint returns a manifest of some previously defined cohort. Parameters are send to the endpoint in the request body. The JSON schema of the manifestBody can be seen on the IDC API v2 UI page. Here is an example:
The fields parameter of the body indicates the fields whose values are to be included in the returned manifests. The /fields API endpoint returns a list of the fields that can be included in a manifest.
The counts, group_size, sql and page_size parameters will be described in subsequent sections.
Every row in the returned manifest will include one value for each of the above fields.
The POST /cohorts/manifest/preview API accepts both a fields list, and a cohort definition in the manifestPreviewBody. Here is an example manifestPreviewBody:
This endpoint behaves like the following API sequence:
That is, it behaves as if a cohort is created, a manifest for that cohort is returned and the new cohort is deleted.
The /cohorts/manifest/{cohort_id} endpoint returns a manifestResponse JSON object and the /cohorts/manifest/preview returns a manifestPreviewResponse JSON object. Here is an example manifestResponse:
The cohort definition is included so that the manifest is self-documenting. The manifest_data component of the manifest component contains a row for each distinct combination of the requested fields in the cohort. The idc_data_version in the cohort_def is the IDC version when the cohort was created. To generate the manifest, the cohort's filter is applied against the data in that IDC version.
The structure of the manifestPreviewResponse returned by the /cohorts/manifest/preview API endpoint is identical to the manifestResponse except that it does not have a cohort_id or user_email component.
Because the /cohorts/manifest/preview API endpoint is always applied against the current IDC version, the idc_data_version in the cohort_def is always that of the current IDC version.
The next_page value is described in the next section.
We use the term group to indicate the set of all instances in the cohort having the values of some row in the manifest. Thus the values of the first row above:
implicitly define a group of instances in the cohort, each of which has those values.
When the group_size parameter in the manifestBody or manifestPreviewBody is true, the resulting manifest includes the total size in bytes of the instances in the corresponding group. Following is a fragment of the manifest for the same cohort above, but where the fields list includes group_size:
Here we see that the instances in the group corresponding to the first result row have a total size of 2,690,320B.
The totalFound value at the end of the manifest tells us that there are 626 rows in the manifest, meaning the manifest contains 626 different combinations of Modality, SliceThickness, age_at_diagnosis, aws_bucket, and crdc_series uuid. (The group size does not add to the combinatorics.) The rowsReturned value indicates that all the rows in the manifest were return in the first "page". If not all the rows had been returned, we can ask for additional "pages" as described in the next section.
The group_size parameter is optional and defaults to false .
If the counts parameter is true, the resulting manifest will selectively include counts of the instances, series, studies, patients and collections in each group. Which counts are included in a manifest is determined by the granularity and which, in turn, is determined by certain of the possible fields in the fields parameter list of the manifestBody or manifestPreviewBody.
For example, if the fields parameter list includes the SOPInstanceUID field, there will one group per instance in the manifest. Thus the manifest has instance granularity. A manifest has one of instance, series, study, patient, collection or version granularity.
For a given manifest granularity, and when counts is True, counts of the "lower level" objects are reported in the manifest. Thus, if a cohort has series granularity, then the count of all instances in each group is reported. If a cohort has study granularity, then the count of all instances in each group and of all series in each group is reported. And so on. This is described in detail in the remainder of this section.
In the following, manifest examples are based on this filterSet:
A manifest will have instance granularity if the fields parameter list includes one or both of the fields:
SOPInstanceUID
crdc_instance_uuid
Both of these fields are unique to each instance. Therefore the resulting manifest will include one row for each instance in the specified cohort. For example, the following fields list will result in a manifest having a row per instance:
Each row will include the SOPInstanceUID, Modality and SliceThickness of the corresponding instance.
The counts parameter is ignored because there are no 'lower level' objects than instances,
A manifest will have series granularity if it goes not have instance granularity and the fields parameter list includes one or more of thee field:
SeriesInstanceUID
crdc_series_uuid
Both of these fields are unique to each series, and therefore the resulting manifest will include at least one row per series in the specified cohort. For example, the following fields list will result in a manifest having one or more rows per series:
Because the SeriesInstanceUID is unique to each series in a cohort (more accurately, all instances in a series have the same SeriesInstanceUID), there will be at least one row per series in the resulting manifest. However, SliceThickness is not necessarily unique across all instance in a series. Therefore, the resulting manifest may have multiple rows for a given series...rows in which the SeriesInstanceUID is the same but the SliceThickness values differ. DICOM modality should always be the same for all instances in a series; therefore it is not expected to result in multiple rows per series.
If the counts parameter is true, each row of the manifest will have:
an instance_count value that is the count of instances in the group corresponding to the row
If the above fields then this is a fragment of the series granularity manifest of our example cohort:
This tells us that the group of instances corresponding to the first row of the manifest results has 151 members.
A manifest will have study granularity if it goes not have series or instance granularity and the queryFields list includes one or more of the fields:
StudyInstanceUID
crdc_study_uuid
Both of these fields are unique to each study, and therefore the resulting manifest will include at least one row per study in the specified cohort. For example, the following fields list will result in a manifest having a one or more rows per study:
Similarly, SliceThickness can vary not only among the instances in a series, but among series in a study. Therefore, the resulting manifest may have multiple rows for a study, and which differ from each other in both SliceThickness and Modality.
If counts is in the fields list, each row of the manifest will have:
an instance_count value that is the count of instances in the group corresponding to the row
a series_count value that is the count of series in the group corresponding to the row
If the fields list is as above, then this is a fragment of the study granularity manifest of our example cohort:
This tells us that the group of instances corresponding to the first row of the manifest results has 212 members, divided among two series. The group of instances corresponding to the third row of the manifest results has two members in a single series.
A manifest will have patient granularity if it goes not have study, series or instance granularity and the fields list includes the field PatientID. This field is unique to each patient, and therefore the resulting manifest will include at least one row per patient in the specified cohort. For example, the following fields list will result in a manifest having a one or more rows per study:
Because the PatientID is unique to each patient in a cohort (more accurately, all instances in a study have the same PatientID), there will be at least one row per patient in the resulting manifest. It is common for a patient's series to examine different body parts. Therefore, the resulting manifest may well have more than one row per patient.
If counts is in the fields list, each row of the manifest will have:
an instance_count value that is the count of instances in the group corresponding to the row
a series_count value that is the count of series in the group corresponding to the row
a study_count value that is the count of studies in the group corresponding to the row
If the fields list is as above, then this is a fragment of the patient granularity manifest of our example cohort:
This tells us that the group of instances corresponding to the first row of the manifest results has 212 members divided among two series, and both in a single study.
A manifest will have collection granularity if it goes not have patient, study, series or instance granularity and the fields parameter list includes the field collection_id. This field is unique to each collection, and therefore the resulting manifest will include at least one row per collection in the specified cohort. For example, the following fields list will result in a manifest having a one or more rows per study:
Because the collection_id is unique to each collection in a cohort (more accurately, all instances in a collection have the same collection_id), there will be at least one row per collection in the resulting manifest. It is common for a collection to have patients of different ages. Therefore, the resulting manifest may well have more than one row per patient.
If the fields list is as follows:
then this is a fragment of the collection granularity manifest of our example cohort:
A manifest will have version granularity if it does not have collection, patient, study, series or instance granularity. At this granularity level, the rows in the manifest return the combinations of queried values across all collects, patients, studies, series and instances in the cohort.
When the fields list is as follows:
then this is a fragment of the version granularity manifest of our example cohort:
Row one of the results tells us that the cohort has 212 instances having a Null SliceThickness and modality="CT". Also, there are apparently 87 different combinations of Modality and SliceThickness in the cohort as shown by the totalFound value.
The IDC API is based on several IDC Data Model Concepts.
In IDC, a cohort is a set of subjects (DICOM patients) that are identified by applying a Filter Set to the Data Sources of some IDC data version. Because a cohort is defined with respect to an IDC data version, the set of subjects in the cohort, as well as all metadata associated with those subjects, is exactly and repeatably defined.
Over time, the set of data hosted by the IDC will change. For the most part, such changes will be due to new data having been added. The totality of IDC hosted data resulting from any such change is represented by a unique IDC data version ID. That is, each time that the set of publicly available data changes, a new IDC version is created that exactly defines the revised data set.
The IDC data version is intended to enable the reproducibility of research results. For example, consider a patient in the DICOM data model. Over time, new studies might be performed on a patient and become associated with that patient, and the corresponding DICOM instances will then be added to the IDC hosted data. Moreover, additional patients might well be added to the IDC data set over time. This means that the set of subjects defined by some filtering operation will change over time. Thus, for purposes of reproducibility, we define a cohort in terms of a set of filter groups and an IDC data version.
Note that on occasion some data might be removed from a collection, though this is expected to be rare. Such a removal will result in a new IDC data version which excludes that data. Such removed data will, however, continue to be available in any previous IDC data version in which it was available. There is one exception: data that is found to contain Personally Identifiable Information (PII) or Protected Health Information (PHI) will be removed from all IDC data versions.
Note: currently a cohort is always defined in terms of a single filter group and an IDC Data Version. In the future we may add support for multiple filter groups.
A filter group selects some set of subjects in the IDC hosted data, and is a set of conditions, where each condition is defined by an attribute and an array of values. An attribute identifies a field (column) in some data source (BQ table). Each filter group also specifies the IDC data version upon which it operates.
A filter group selects a subject if, for every attribute in the filter group, some datum associated with the subject satisfies one or more of the values in the associated array of values. A datum satisfies a value if it is equal to, less than, less than or equal to, between, greater than or equal to, or greater than, as required by the attribute. This is explained further below.
For example, the (attribute, [values]) pair (Modality, [MR, CT]) is satisfied if a subject "has" a Modality of MR or CT in any data associated with that subject. Thus, this (attribute, [values]) pair would be satisfied, for example, by a subject who has one or more MR series but no CT series.
Note that if a filter group includes more than one (attribute, [values]) pair having the same attribute, then only the last such (attribute, [values]) pair is used. Thus if a filter group includes the (attribute, [values]) pairs (Modality, [MR]) and (Modality, [CT]), in that order, only (Modality, [CT]) is used.
Here is an example filter group:
This filter group will select any subject in the TCGA-LUAD or TCGA-KIRC collections, if the subject has any DICOM instances having a modality of CT or MR, the subject's race is WHITE, and the subjects age at diagnosis is between 53 and 69.
A collection is a set of DICOM data provided by a single source. Collections are further categorized as Original collections or Analysis collections. Original collections are comprised primarily of DICOM image data that was obtained from some set of patients. Typically, the patients in an Original collection are related by a common disease.
Analysis collections are comprised of DICOM data that was generated by analyzing other (typically Original) collections. Typically such analysis is performed by a different entity than that which provided the original collection(s) on which the analysis is based. Examples of data in analysis collections include segmentations, annotations and further processing of original images. Note that some Original collections include such data, though most of the data in Original collections are original images.
A data source is a BQ table that contains some part of the IDC metadata complement. API queries are performed against one or more such tables that are joined (in the relational database model sense). Data sources are classified as being of type Original, Derived or Related. Original data sources contain DICOM metadata from the DICOM objects in TCIA Original and TCIA Analysis collections. Derived data sources contain processed data: in general this is analytical data has been processed to enable easier SQL searches. Related data sources contain ancillary data that may be specific to some set of collections. For example, TCGA biospecimen and clinical data are maintained in such tables.
Data sources are versioned. That is, when the data in a data source changes, a new version of that set of data is defined. An IDC data version is defined in terms of a specific version of each data source. Note that over time, new data sources may be added (or, less likely, removed). Thus two IDC data versions may have a different number of data sources.
Both the IDC Web App and API expose selected fields in the various data sources against which queries can be performed. Each attribute has a data type, one of:
Continuous Numeric An attribute with data type Continuous Number will have a numeric (float) value. For example, age_at_diagnosis is an attribute of data type Continuous Numeric. In order to enable relative numeric queries, the API exposes 6 variations of each Continuous Numeric attributes as filter set attribute names. These variations are the base attribute name with no suffix, as well as the base attribute name with one of the suffixes: _gt, _gte, _btw, _btwe, _ebtw, _ebtwe, _lte, _lt. The value array of the _*btw* variations must contain exactly two numeric values, in numeric order (least value first). The value array of the other variations must contain exactly one numeric values. The (attribute, value array) pair for a Continuous Numeric _attribute_ is satisfied according to the suffix as follows:
<no suffix>: If an attribute is equal to the value in the value array
gt: If an attribute is greater than the value in the value array
gte: If an attribute is greater than or equal to the value in the value array
btw: if an attribute is gt the first value and lt the second value in the value array
ebtw: if an attribute is gte the first value and lt the second value in the value array
btwe: if an attribute is gt the first value and lte the second value in the value array
ebtwe: if an attribute is gte the first value and lte the second value in the value array
lte: If an attribute is less than or equal to the value in the value array
lt: If an attribute is less than the value in the value array
Categorical Numeric An attribute with data type Categorical Numeric has one of a defined set of numeric values. The corresponding value array must have a single numeric value.
A manifest is a list of access methods and other metadata of the data objects corresponding to the objects in some cohort. There are two types of access methods:
GUID
>> curl https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects/bd68332e-521f-4c45-9a88-e9cc426f5a8d
{ "access_methods":[{ "access_id":"gs", "access_url":{ "url":"gs://idc-open/bd68332e-521f-4c45-9a88-e9cc426f5a8d.dcm" }, "region":"", "type":"gs" } ], "aliases":[ ], "checksums":[ { "checksum":"9a63c81a4b3b4bc3950678a4e9acc930", "type":"md5" } ], "contents":[ ], "created_time":"2021-08-27T21:15:02.385181", "description":null, "form":"object", "id":"dg.4DFC/bd68332e-521f-4c45-9a88-e9cc426f5a8d", "mime_type":"application/json", "name":"", "self_uri":"drs://nci-crdc.datacommons.io/dg.4DFC/bd68332e-521f-4c45-9a88-e9cc426f5a8d", "size":528622, "updated_time":"2021-08-27T21:15:02.385185", "version":"faf7385b" }
Resolving such a GUID returns a DrsObject. The access methods in the returned DrsObject include one or more URLs at which corresponding DICOM entities can be accessed. GUID manifests are recommended for long term archival and reference.
In the above, we can see that the returned DrsObject includes the GCS URL gs://idc-open/bd68332e-521f-4c45-9a88-e9cc426f5a8d.dcm
.
URL
The URLs in a URL based manifest can be used to directly access a DICOM instance in Google Cloud Storage. URLs are structured as follows:
gs://<GCS bucket>/<GUID>.dcm
This is a typical URL:
gs://idc-open/bd68332e-521f-4c45-9a88-e9cc426f5a8d.dcm
Though rare, the URL of some object can change over time. In such a case, the corresponding DRSObject will be updated with new URL. However, the original URL will then be "stale".
Additional values can optionally be included in the returned manifest. See the manifest API descriptions for more details.
Some of the API calls require authentication. This is denoted by a small lock symbol. Authentication can be performed by clicking on the ‘Authorize’ button at the top right of the page.
The API will return collection metadata for the current IDC data version. The request can be run by selecting ‘Execute’.
Request Response
The Swagger UI submits the request and shows the curl code that was submitted. The ‘Response body’ section will display the response to the request. The expected format of the response to this API request is shown below:
The actual JSON formatted response can be downloaded by selecting the ‘Download’ button.
The syntax for all of API data structures is detailed at the bottom of the UI page.
This section describes v1 of the IDC REST API . This API is designed for use by developers of image analysis and data mining tools to directly query the public resources of the IDC and retrieve information into their applications. The API complements the IDC web application but eliminates the need for users to visit the IDC web pages to perform cohort creation, manifest export, and transfer of image data to some local file system.
The API is a RESTful interface, accessed through web URLs. There is no software that an application developer needs to download in order to use the API. The application developer can build their own access routines using just the API documentation provided. The interface employs a set of predefined query functions that access IDC data sources.
The IDC API is intended to enable exploration of IDC hosted data without the need to understand and use the Structure Query Language (SQL). To this end, data exploration capabilities through the IDC API are limited. However, IDC data is hosted using the standard capabilities of the the Google Cloud Platform (GCP) Storage (GCS) and BigQuery (BQ) components. Therefore, all of the capabilities provided by GCP to access GCS storage buckets and BQ tables are available for more advanced interaction with that data.
This page provides details on each of the IDC API endpoints.
The following characteristics apply to all IDC APIs:
You access a resource by sending an HTTP request to the IDC API server. The server replies with a response that either contains the data you requested, or a status indicator.
An API request URL has the following structure: <BaseURL><API version><QueryEndpoint>?<QueryParameters>. For example, this curl command is a request for metadata on all IDC collections:
curl -X GET "https://api.imaging.datacommons.cancer.gov/v1/collections" -H "accept: application/json"
Authorization
Some of the APIs, such as /collections and /cohorts/preview, can be accessed without authorization. APIs that access user specific data, such as cohorts, necessarily require account authorization.
To access these APIs that require IDC authorization, you will need to generate a credentials file. To obtain your credentials:
Execute the idc_auth.py
script either through the command line or from within python. Refer to the idc_auth.py
file for detailed instructions.
Several IDC APIs, specifically /cohorts/manifest/preview, /cohorts/manifest/{cohort_id}, /cohorts/query/preview, /cohorts/query/{cohort_id}, and /dicomMetadata, are paged. That is, several calls of the API may be required to return all the data resulting from such a query. Each accepts a _page_size query parameter that is the maximum number of objects that the client wants the server to return. The returned data from each of these APIs includes a next_page value. next_page is null if there is no more data to be returned. If next_page is non-null, then more data is available.
There are corresponding queries, /cohorts/manifest/nextPage, /cohorts/query/nextPage, and /dicomMetadata/nextpage endpoints, that each accept two query parameters: next_page, and page_size. In the case that the returned next_page value is not null, the corresponding ../nextPage endpoint is accessed, passing the next_page token returned by the previous call.
The manifest and query endpoints may return an HTTP 202 error. This indicates that the request was accepted but processing timed out before it was completed. In this case the client should resubmit the request including the next_page token that was returned with the error response.
Use IDC-provided Looker Studio template to build a custom dashboard for your cohort
Step 1: Prepare the manifest BigQuery table
Export the cohort manifest as a BigQuery table, and take note of the location of the resulting table.
Step 2: Duplicate the template
When prompted, do not change the default options, and click "Copy Report".
Step 3: Configure data source
Select "Resource > Manage added data sources"
Select "Edit" action:
Update the custom query as instructed. This will select all of the DICOM metadata available for the instances in your cohort.
For example, if the location of your manifest table is canceridc-user-data.user_manifests.manifest_cohort_101_20210127_213746
, the custom query that will join your manifest with the DICOM metadata will be the following:
Once you updated the query, click "Reconnect" in the upper right corner.
Make sure you select a valid Billing Project that you can use to support the queries!
Accept the below, if prompted (you may also be notified about changes to the schema of the table, so the message may be different).
Click "Done" on the next screen:
Click "Close" on the next screen:
You are Done! The dashboard for your cohort is now live: you can "View" it to interact with the content, you can edit it to explore additional attributes in the cohort, and you can choose to keep it private or share with a link!
Topic-specific dashboards
Program- and Collection-specific dashboards
In this section you can learn how to very quickly make a custom Looker Studio dashboard to explore the content of your cohort, and find some additional examples of using Looker Studio for analyzing content of IDC.
Follow these steps:
Once you created a VM and your setup is complete, it’s very easy to connect to your VMs through ssh or Web desktop interface.
It is free for academics!
You can do a lot with the basic credit allocation! Entry-level allocations can be on the order of 100,000s, while the burn rate is, for example, 8 SUs/hour for a medium size VM (8 CPUs/30 GB RAM). As a reference:
it takes about 1 hour to build Slicer application from scratch on a medium-sized VM using 7 threads
Geared to help you save! Unlike the VMs you get from the commercial providers, JetStream VMs can be shelved. Once a VM is shelved, you spend zero SUs for keeping it around (in comparison, you will keep paying for the disk storage of your GCP VMs even when they are turned off).
Customer support is excellent! We received responses within 1-2 days. On some occasions, we observed glitches with Web Desktop, but those could often be resolved by restarting the VM.
Google Colaboratory, or “Colab” for short, is a product from Google Research. Colab allows anybody to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education. More technically, Colab is a hosted Jupyter notebook service that requires no setup to use, while providing free access to computing resources including GPUs.
Potentially interesting sources of example notebooks:
A GCP VM you want to use for code development is up and running
Run the following command to populate SSH config files with host entries for each VM instance you have running
If the previous step completed successfully, you should see the running VMs in the Remote Explorer of VS Code, as in the screenshot below, and should be able to open a new session to those remove VMs.
Note that the SSH configuration may/will change if you restart your VM. In this case you will need to re-configure (re-run step 2 above).
"In statistics, marketing and demography, a cohort is a group of who share a defining characteristic (typically subjects who experienced a common event in a selected time period, such as birth or graduation)." ()
String An attribute with data type String may have an arbitrary string value. For example, the possible values of a StudyDescription attribute are arbitrary. When the values array of a (String attribute, [values]) pair contains a single value, an SQL LIKE operator is used and standard SQL syntax and semantics are supported. Thus a ('StudyDescription",["%SKULL%"]) will match any StudyDescription that contains "SKULL", When the values array of a (String attribute, [values]) pair contains more that one value, an SQL UNNEST operator is used and standard SQL syntax and semantics are supported. See the documentation for details.
Categorical String An attribute with data type Categorical String will have one of a defined set of string values. For example, Modality is an attribute, and has possible values 'CT', 'MR', 'SR', etc. In this case, the values are defined by the DICOM specification. The defined values of other Categorical String attributes may be established by other entities. When the values array of a (Categorical String attribute, [values]) pair contains a single value, an SQL LIKE operator is used and standard SQL syntax and semantics are supported. Thus a ('StudyDescription",["%SKULL%"]) will match any StudyDescription that contains "SKULL", When the values array of a (Categorical String attribute, [values]) pair contains more that one value, an SQL UNNEST operator is used and standard SQL syntax and semantics are supported. See the documentation for details.
A GUID is a persistent identifier that can be resolved to a GUID persistence ensures that the data which the GUID represents can continue to be located and accessed even if it has been moved to a different hosting site. A GUID identifies a particular version of an IDC data object, and there is a GUID for every version of every DICOM instance and series in IDC hosted data. GUIDs are issued by the NCI Cancer Research Data Commons. This is a typical CRDC GUID: dg.4DFC/83fdfb25-ad87-4879-b0f3-b9850ef0b216 A GUID can be resolved at by appending the UUID to the that URL. E.G. (formatting added to the curl response for clarity):
The can be used to see details about the syntax for each call, and also provides an interface to test requests.
For a quick demonstration of the syntax of an API call, test the request. You can experiment with this endpoint by clicking the ‘Try it out’ button.
The IDC API conforms to the specification which "defines a standard, language-agnostic interface to RESTful APIs which allows both humans and computers to discover and understand the capabilities of the service without access to source code, documentation, or through network traffic inspection."
If you have feedback about the desired features of the IDC API, please let us know via the IDC .
is a web based interface that allows users to try out APIs and easily view their documentation. You can access the IDC API SwaggerUI .
This serves as an interactive tutorial to accessing the IDC API using Python.
Clone the to your local machine.
Example usage of the generated authorization is demonstrated by code in the Google Colab notebook.
You can use to build a custom dashboard for your own cohort, which will look like the screenshot below in three relatively simple steps.
Open the dashboard template following this link: , and click "Use template" to make a copy of the dashboard.
(see details in )
is a free tool that turns your data into informative, easy to read, easy to share, and fully customizable dashboards and reports.
is a program supported by the US National Science Foundation (NSF) to provide educators with free and convenient access to advanced computational resources.
If you have a university email account, you can complete a relatively easy application process to receive an allocation of free credits that you can then use to create pre-configured GPU-enabled cloud-based linux virtual machines with desktop interface available via browser. You can use those machines, for example, to have a convenient access to an instance of for experimenting with AI models, or for training DL networks.
Create an account and request an ACCESS allocation at this page: . There are 4 different levels, with each giving you a different number of “credits” that you use to create your VM instances. Each of these levels requires you to submit a different application. For the Explore ACCESS allocation (lowest tier), you need to write a simple abstract to justify why you needed these resources. Other tiers require more lengthy descriptions of what you’ll do with the ACCESS resources. In our experience, applications can be approved in as soon as a few days after submitting the application. You can be a PI and have multiple Co-PIs with you on the project, so you can all access the Jetstream2 resources.
Once you get approved, your allocation is valid for a 12 month period, and you get half of the credits to start. To start using these credits you exchange them for Service Units (SUs) on different platforms. We experimented with the one called , which provides easy interface to cloud-based computing resources. If you want to use JetStream2, you will need to exchange your ACCESS credit allocation for JetStream2 SUs here: . Usually this exchange is approved within a few days if not less.
Once you get the SU’s you can access JetStream interface to configure and create VMs here: (you can lean more about available configurations from this documentation page: ).
Very easy to set up. As of writing, there is no similar product available from Google Cloud, which would provide desktop access to a VM with a comparable ease of access. AWS provides , but we have yet to experiment to evaluate it.
it took ~7 days and ~5000 SUs to train the model in (see summary in the slides ) using g3.large
VM configuration
JetStream2:
ACCESS:
This section contains various pointers that may be helpful when working with .
IDC Colab example notebooks are maintained in this repository:
Notebook demonstrating deployment and application of abdominal structures segmentation tool to IDC data, developed for the course:
, contributed by , Mayo Clinic
, contributed by , Mayo Clinic
Notebooks contributed by , ISB-CGC, demonstrating the utility of BigQuery in correlative analysis of radiomics and genomics data:
Colab limitations:
Transferring data between Colab and Google Drive:
Google Colab Tips for Power Users:
Mounting GCS bucket using gcsfuse
:
Almost-free Jupyter Notebooks on Google Cloud:
Get started with using BigQuery with IDC data following our
contains various examples of using BigQuery for searching DICOM metadata in IDC BQ tables.
Temporary tables:
has a useful feature of allowing you to develop code on a remote VM from the convenience of your desktop. You can follow the steps below to configure your development environment for this task.
`` installed on your computer
installed on your computer
IDC API v1 has been released with the IDC Production release (v4).
Our current experience in using NCI Cloud Resources for cancer image analysis is summarized in the following preprint:
A motivation for using desktop applications like 3D Slicer on a VM is that their computing power close to the data, so heavy network operations such as storage bucket or dicom store access may be significantly faster than accessing the same resources from a remote machine. They are also highly configurable, so that you can easily allocate the number of cores or memory needed for a given task. Note that can even change these configurations so that, for example, you can shut down the machine, add a GPU and more memory, and then boot the same instance and pick up where you left off.
In addition, these desktops are persistent in the sense that you can start a task such as labeling data for a machine learning task, disconnect your ssh session, and reconnect later to pick up where you left off without needing to restart applications or reload data. This can be convenient when tending long-running computations, accessing your work from different computers, or working on a network that sometimes disconnects.
The instructions here are just a starting point. There are many cloud options available to manage access scopes for the service accounts, allocate disks, and configure other options.
You can launch a VM with a GPU in your project with a command like this in your local terminal (replace vm-name
with a name for your machine):
Once it boots in about 90 seconds you can type:
You can launch a VM without a GPU in your project with a command like this in your local terminal (replace vm-name
with a name for your machine):
Once it boots in about 90 seconds you can type:
On the remote machine run:
Each time you reboot the machine, run this:
This section contains various recipes that might be useful in utilizing GCP Compute Engine (GCE).
You are also encouraged to review the slides in the following presentation that provides an introduction into GCE, and shares some best practices the its usage.
Most of the same linux commands, scripts, pipelines/workflows, imaging software packages and docker containers that you run on your local machine can be executed on virtual machines on Google Cloud with experimentation and fine tuning.
A good way to estimate costs for running a workflow/pipeline on large data sets is to test them first on a small subset of data.
Example use-cases:
Also Broad’s popular variant caller pipeline, GATK, was designed to be able to run on preemptible VMs.
By default, each virtual machine instance has a single boot persistent disk that contains the operating system. The default size is 10GB but can be adjusted up to 64TB in size. (Be careful! High costs here, spend wisely!)
Persistent disks are restricted to the zone where your instance is located.
Use persistent disks if you are running analyses that require low latency and high-throughput.
Unlike persistent disks, Cloud Storage buckets are not restricted to the zone where your instance is located.
Additionally, you can read and write data to a bucket from multiple instances simultaneously.
You can mount a GCS bucket to your VM instance when latency is not a priority or when you need to share data easily between multiple instances or zones. An example use-case: You want to slice thousands of bam files and save the resulting slices to share with a collaborator who has instances in another zone to use for downstream statistical analyses.
Once a manifest has been created, typically the next step is to load the files onto a VM for analysis, and the easiest way to do this is to create your manifest in a BigQuery table and then use that to direct the file loading onto a VM. This guide shows how this can be done,
You also need to insure the machine has enough disk space. One of the checks in the script provided below is to calculate the total file load size. You might want to run that portion of the script and resize the disk as needed before actually doing the load.
Performs a query on the specified BigQuery manifest table and creates a local manifest file on your VM.
Performs a query that maps the GCS URLs of each file into DICOM hierarchical directory paths, and writes this out as a local TSV file on your VM.
Performs a query that calculates the total size of all the downloads, and reports back if there is sufficient space on the filesystem to continue.
Uses a multi-threaded bucket reader to pull the files from the GCS buckets and places them in the appropriate DICOM hierarchical directory.
To install the code on your VM and then setup the environment:
You then need to customize the settings in the script:
Finally, run the script:
are components of the that bring data and computational power together to enable cancer research and discovery.
Thiriveedhi, V. K., Krishnaswamy, D., Clunie, D., Pieper, S., Kikinis, R. & Fedorov, A. Cloud-based large-scale curation of medical imaging data using AI segmentation. Research Square (2024). doi:
These instructions provide a reference example of how you can start up a traditional workstation desktop on a VM instance to run interactive applications like and access the desktop via a conventional web browser. Two options are shown, either with or without a GPU. Note that GPUs are significantly more expensive so only enable it if needed. For 3D Slicer the main benefit of a GPU is for rendering, so operations like dicom processing and image segmentation are quite usable without a GPU. Even volume rendering is fairly usable if you choose the CPU rendering option. Other operations such as training machine learning models may benefit from an appropriate GPU.
Then you can open to get to your desktop.
Then you can open to get to your desktop.
This effort is a work in progress with a minimal desktop environment. Further refinement is expected and community contributions would be welcome! A description of the background and possible evolution of this work is .
See IDC API endpoint details at .
W. Longabaugh. Introduction to Google Cloud Platform. Presented at MICCAI 2021. ()
The basics and best practices on how to launch virtual machines (VMs) are described in our documentation. NOTE: When launching VMs, please maintain the default firewall settings.
Compute Engine instances can run the public images for Linux and Windows Server that Google provides as well as private custom images that you can or . Be careful as you spin up a machine, as larger machines cost you more. If you are not using a machine, shut it down. You can always restart it easily when you need it. Example use-case: You would like to run Windows-only genomics software package on the TCGA data. You can create a Windows based VM instance.
More details on how to deploy docker containers on VMs are described here in Google’s documentation:
There are different VM types depending on the sort of jobs you wish to execute. By default, when you create a VM instance, it remains active until you either stop it or delete it. The costs associated with VM instances are detailed here:
If you plan on running many short compute-intensive jobs (for example indexing and sorting thousands of large bam files), you can execute your jobs on preemptible virtual machines. They are 80% cheaper than regular instances.
Using preemptible VMs, researchers were able to quantify transcript levels on over 11K TGCA RNAseq samples for a total cost of $1,065.49. Tatlow PJ, Piccolo SR. . Scientific Reports 6, 39259
Google cloud computing can be estimated .
Because of the ability to see a of Github postings, if a password or bearer token is part of software code (e.g. notebook or colaboratory) it will be permanently available on Github. This is a security risk!! Do not put bearer tokens or other passwords into workbooks, instead refer to them in the code and place those in a location not posted into Github (if you do post it to GitHub, it then immediately becomes public, usable, and able to be stolen and used maliciously by others). If you do accidentally post one to Github: 1) immediately change passwords on your systems to remove the exposure provided by the exposed password, 2) let those who involved in the security of your system and data know, and 3) remedy your code-base so future saves to Github do not include passwords or tokens in your codebase.
The Google Cloud Platform offers a number of different storage options for your virtual machine instances:
Google Cloud Storage (GCS) buckets are the most flexible and economical storage option.
You can save objects to GCS buckets including images, videos, blobs and unstructured data. A comparison table detailing the current pricing of Google’s storage options can be found here:
The first step is to . You will want to copy this table into the project where you are going to run your VM. Do this using the Google BQ console, since the exported table can be accessed only using your personal credentials provided by your browser. The table copy living in the VM project will be readable by the service account running your VM.
Start up your VM. If you have many files, you will want to speed the loading process by using a VM with multiple CPUs. Google describes the various , but is not very specific about ingress bandwidth. However, in terms of published egress bandwidth, the larger machines certainly have more. Experimentation showed that an n2-standard-8 (8 vCPUs, 32 GB memory) machine could load 20,000 DICOM files in 2 minutes and 32 secconds, using 16 threads on 8 CPUs. That configuration reached a peak throughput of 68 MiB/s.
performs the following steps:
dicom_all_view
is a BQ view, as indicated by the icon to the left from the table name. dicom_all
table is the result of running the query that defines the dicom_all_view
. Get links to the IDC API Swagger UI and IDC documentation
Returns a list of IDC data versions and activation dates
Returns a list of collections, and associated metadata, in the current IDC data version.
Returns a list of the analysis results, and associated metadata, in the current IDC data version
Returns a list of 'filters', organized by data source (BQ table), for the current IDC data version. An IDC cohort is defined by a 'filterset', a set of (filter,[values]) pairs, and the IDC version against which the filterset is applied. The returned data is grouped by source (the BQ table that contains the corresponding filter values). For each filter, its data type and units, when available, are reported.
Return a list of the values accepted for a 'categorical filter'. A categorical filter is a filter having a data type of 'Categorical String' or 'Categorical Number'.
Categorical filter whose values are to be returned
Return a list of queryable manifest fields.
IDC data version whose data is to be returned. If the version is 'current', the fields of the current IDC version are returned.
Returns a list of the user's cohorts and associated metadata. Authorization is required in order to access this endpoint.
Delete a specified cohort. Authorization is required in order to access this endpoint.
ID of cohort to be deleted.
Returns the next page of a /cohorts/manifest/preview request, when additional data is available.
The next_page token returned by a previous access of the /cohorts/manifest/preview endpoint. The token identifies the next page to be retrieved
The maximum number of rows to be returned. If the manifest contains additional rows, another 'next_page' token is returned.
Returns the next page of a /cohorts/manifest request, when additional data is available. Authorization is required in order to access this endpoint.
The next_page token returned by a previous access of the /cohorts/manifest endpoint. The token identifies the next page to be retrieved
The maximum number of rows to be returned. If the manifest contains additional rows, another 'next_page' token is returned.
Retrieve user's account information. Authorization is required in order to access this endpoint.
Create a cohort as defined by a specified 'filterset' and IDC version. Authorization is required in order to access this endpoint.
""
""
Delete a list of the user's cohorts. Authorization is required in order to access this endpoint.
Returns a manifest of a 'previewed cohort' as defined by a specified filterset. The filterset is always applied to the current IDC version. The metadata to be returned in the manifest is configurable. A previewed cohort is not saved in the user's IDC account.
If True, return counts of DICOM objects
False
If True, return size in bytes of instances in group
False
If True, return the BQ SQL for this query.
False
Maximum number of rows to return
1000
Returns a manifest of a cohort that was previously saved under the user's IDC account. The metadata to be returned in the manifest is configurable. Authorization is required in order to access this endpoint.
IDC Cohort ID
If True, return counts of DICOM objects
False
If True, return size in bytes of instances in group
False
If True, return the BQ SQL for this query.
False
Maximum number of rows to return
1000