Only this pageAll pages
Powered by GitBook
1 of 85

prod

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Data

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Tutorials

Loading...

Loading...

Loading...

DICOM

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Portal

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

API

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Cookbook

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Support

If you need support with IDC or have any questions, please open a new topic in IDC User Forum (preferred) or send email to [email protected].

Would you rather discuss your questions in an meeting with an expert from the IDC team? Book a 1-on-1 support session here: https://tinyurl.com/idc-help-request

Discounted use and training materials for NIH-funded investigators

If you are an NIH-funded investigator, you can join the that offers significant discounts on the use of cloud resources, and free training courses and materials on the use of the cloud.

Welcome!

If you need support with IDC or have any questions, please open a new topic in (preferred) or send email to [email protected].

Would you rather discuss your questions in an meeting with an expert from the IDC team? Book a 1-on-1 support session here:

is a cloud-based environment containing publicly available cancer imaging data co-located with analysis and exploration tools. IDC is a node within the broader NCI infrastructure that provides secure access to a large, comprehensive, and expanding collection of cancer research data.

NIH STRIDES Initiative
Highlights
  • >95 TB of data: IDC contains radiology, brightfield (H&E) and fluorescence slide microscopy images, along with image-derived data (annotations, segmentations, quantitative measurements) and accompanying clinical data

  • free: all of the data in IDC is publicly available: no registration, no access requests

  • commercial-friendly: >95% of the data in IDC is covered by the permissive CC-BY license, which allows commercial reuse (small subset of data is covered by the CC-NC license); each file in IDC is tagged with the license to make it easier for you to understand and follow the rules

  • cloud-based: all of the data in IDC is available from both Google and AWS public buckets: fast and free to download, no out-of-cloud egress fees

  • harmonized: all of the images and image-derived data in IDC is harmonized into standard DICOM representation

Functionality

IDC is as much about data as it is about what you can do with the data! We maintain and actively develop a variety of tools that are designed to help you efficiently navigate, access and analyze IDC data:

  • exploration: start with the IDC Portal to get an idea of the data available

  • visualization: examine images and image-derived annotations and analysis results from the convenience of your browser using integrated OHIF, VolView and Slim open source viewers

  • programmatic access: use idc-index python package to perform search, download and other operations programmatically

  • cohort building: use rich and extensive metadata to build subsets of data programmatically using idc-index or BigQuery SQL

  • download: use your favorite S3 API client or idc-index to efficiently fetch any of the IDC files from our public buckets

  • analysis: conveniently access IDC files and metadata from the tools that are cloud-native, such as Google Colab or Looker; fetch IDC data directly into 3D Slicer using

The overview of IDC is available in this open access publication. If you use IDC, please acknowledge us by citing it!

Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. RadioGraphics (2023). https://doi.org/10.1148/rg.230180

IDC User Forum
https://tinyurl.com/idc-help-request
NCI Imaging Data Commons (IDC)
Cancer Research Data Commons (CRDC)

Getting started

We want Imaging Data Commons to be your companion in your cancer imaging research activities - from discovering relevant data to sharing your analysis results and showcasing the tools you developed!

Explore the data available

Check out quick instructions on how to access and use IDC Portal web application that will help you search, subset and visualize data available in IDC.

IDC Portal is integrated with powerful visualization tools: just with your web browser you will be able to see IDC images and annotations using OHIF Viewer, Slim viewer and VolView!

Subset the content you need

We have many tools to help you search data in IDC, so that you download only what you need!

  • you can do basic filtering/subsetting of the data using IDC Portal, but if you are developer, you will want to learn how to use for programmatic access. will introduce you to the basics of idc-index for interaction with IDC content.

  • search clinical data: many of the IDC collections are accompanied by clinical data, which we parsed for you into searchable tabular representation - no need to download or parse CSV/Excel/PDF files! Dive into searching clinical data using .

  • if advanced content does not scare you, check out to learn how to search all of the metadata accompanying IDC using SQL and Google BigQuery.

Download the data you liked

We provide various tools for downloading data from IDC, as discussed in the . Access to all data in IDC is free! No registration. No access request forms. No logins.

  • once you have idc-index python package installed, download from the command line is as easy as running idc download <manifest_file>, or idc download <collection_id>.

  • looking for an interactive "point-and-click" application? is for you (note that you will only be able to visualize radiology - not microscopy - images in 3D Slicer)

Experiment with analysis tools

We want to make it easier to understand performance of the latest advances in AI on real-world cancer imaging data!

  • if you have a Google account, you have free access to Google Colab, which allows you to run python notebooks on cloud VMs equipped with GPU - for free! Combined with idc-index for data access, this makes it rather easy to experiment with the latest AI tools! As an example, take a look at that allows you to apply MedSAM model to IDC data. You will find a growing number of notebooks to help you use IDC in .

  • use IDC to develop HuggingFace spaces that demonstrate the power of your models on real data: see we developed for SegVol

  • growing number of AI medical imaging models is being curated on the platform; see to learn how to apply those models on data from IDC

How about accompanying your next publication by a working demonstration notebook on relevant samples from IDC? You can see an example how we did this in .

Scale the analysis to thousands of cloud VMs

With the cloud, you can do things that are simply impossible to do with your local resources.

  • read to learn how we applied TotalSegmentator+pyradiomics to >126,000 of CT scans of the NLST collection using Terra platform, completing the analysis in ~8 hours with the total cost ~$1000

  • contains the code we used in the above (this is really advanced content!)

Share analysis results or annotations

If you have an algorithm, that you evaluated/published, that can enrich data in IDC with analysis results and you want to contribute those, or if you are a domain expert and would like to publish results of manual annotations you prepared - we want to hear from you!

  • IDC maintains a where we curate contributions of analysis results and other datasets produced by IDC (see the as one example of such contribution)

  • through a dedicated Zenodo record you will have a citation and DOI to get credit for your work; your data is ingested from Zenodo into IDC, and citation will be generated for the users of your data in IDC

  • once your data is in IDC, it should be easier to discover it, combine with other datasets, visualize and use from analysis workflows (as an example, see accompanying the RMS annotations)

Questions?

If you need support with IDC or have any questions, please open a new topic in (preferred) or send email to [email protected].

Would you rather discuss your questions in an meeting with an expert from the IDC team? Book a 1-on-1 support session here:

Key pointers

Resources maintained by the IDC team

  • Portal: https://imaging.datacommons.cancer.gov

  • Discourse (community forum):

  • Documentation:

  • GitHub organization:

  • Tutorials:

Other locations for accessing public imaging data

If you did not find the images you need in IDC, you can consider the following resources:

  • : while most of the public DICOM collections from TCIA are available in IDC, we do not replicate limited access TCIA collections

  • : list curated by Stephen Aylward

  • : list curated by University College London

  • : list curated by New York Univestity Health Sciences Library

Core functions

Easy and efficient access to public cancer imaging data

We ingest and distribute datasets from variety of sources and contributors, primarily focusing on large data collection initiatives sponsored by US National Cancer Institute.

At this time, we do not have resources to prioritize receipt of the imaging data from individual PIs (but we are encouraging submissions of annotations/analysis results for existing IDC data!). Nevertheless, if you feel you might have a compelling dataset, please email us at [email protected].

On ingestion, we harmonize images and image-derived data into DICOM format for interoperability, whenever data is represented in a non-DICOM format.

Upon conversion, the data undergoes Extract-Transform-Load (ETL), which extracts DICOM metadata to make the data searchable, ingests the DICOM files into public S3 storage buckets and a DICOMweb store. Once the data is released, we provide various interfaces to access data and metadata.

Tools to simplify the use of the data

We are actively developing a variety of capabilities to make it easier for the users to work with the data in IDC. Some of the examples of those tools include

  • provides interactive browser-based interface for exploration of IDC data

  • we are the maintainers of - an open-source viewer of DICOM digital pathology images; Slim is integrated with IDC Portal for visualizing pathology images and image-derived data available in IDC

  • we are actively contributing to the , and rely on it for visualizing radiology images and image-derived data

  • is a python package that provides convenience functions for accessing IDC data, including efficient download from IDC public S3 buckets

Support of continuous enrichment of data

We welcome you to apply to contribute analysis results and annotations of the images available in IDC! These can be expert manual annotations, analysis results generated using AI tools, segmentations, contours, metadata attributes describing the data (e.g., annotation of the scan type), expert evaluation of the quality of existing AI-generated annotations in IDC.

If you would like your annotations/analysis results to be considered, you must establish the value of your contribution (e.g., describe the qualifications of the experts performing manual annotations, demonstrate robustness of the AI tool you are applying to images with a peer-reviewed publication or other type of evidence), and be willing to share your contribution under a permissive Creative Commons Attribution .

See more details on our curation policy , and reach out by sending email to with any questions or inquries. Every application will be reviewed by IDC stakeholders.

If your contribution is accepted by the IDC stakeholders:

  • we will work with you to choose the appropriate DICOM object type for your data and convert it into DICOM representation

  • upon conversion, we will create a Zenodo entry under the for your contribution so that you get the Digital Object Identifier (DOI), citation and recognition of your contribution

  • once published in IDC

    • your data will become searchable and viewable in IDC Portal, so it is easier for the users of your data to discover and work with your data

Integration of cancer imaging data with other components of CRDC

IDC is a component of the broader NCI , giving you access to the following:

  • can be used to find data related to the images in IDC in , and

  • Broad and (SB-CGC) can be used to apply analysis tools to the data in IDC (you can read more about how this can be done in from the IDC team)

  • platform curates a growing number of cancer imaging AI models that can be applied directly to the DICOM data available in IDC

SlicerIDCBrowser extension
https://discourse.canceridc.dev/
https://learn.canceridc.dev/
https://github.com/ImagingDataCommons
https://github.com/ImagingDataCommons/IDC-Tutorials/tree/master/notebooks/getting_started
Publications
The Cancer Imaging Archive (TCIA)
Open-Access Medical Imaging Repositories
Medical Image Repositories
Finding Medical Images
email us at [email protected] to inquire about contributing your annotations/analysis results to IDC!
idc-index python package
This python notebook
this notebook
this notebook
Download documentation page
3D Slicer IDC Browser extension
this notebook
this repository
this space
MHub.ai
this notebook
this recent publication
this preprint
this repository
Zenodo community
expert annotations of the RMS-Mutations-Prediction microscopy images collection
this notebook
IDC User Forum
https://tinyurl.com/idc-help-request

Organization of data, v2 through V13 (deprecated)

IDC V14 introduced important enhancements to IDC data organization. The discussion of the organization of data in earlier versions is preserved here.

Jobs

IDC does not currently have open positions

IDC DICOM white papers

The following white papers are intended to provide explanation and clarification into applying DICOM to encoding specific types of data.

Comments and questions regarding those white papers are welcomed from the community! Please ask any related questions on IDC Discourse, or by adding comments directly in the documents referenced below:

  • DICOM SR for communicating planar annotations, 2020

Deprecated functionality

Items in this section capture documentation relevant to organization of data in prior versions of IDC. Those are no longer relevant for the current data organization, and are preserved since the prior versions of data are still available to IDC users.

Release Notes

IDC API v1 has been released with the IDC Production release (v4).

  • 3D Slicer extensions SlicerIDCBrowser can be used for interactive download of IDC data

  • we are contributing to a variety of tools that aim to simplify the use of DICOM in cancer imaging research; these include OpenSlide and BioFormats bfconvert library that can be used for conversion between DICOM Whole Slide Imaging (WSI) format and other slide microscopy formats, dcmqi library for converting image analysis results to and from DICOM representation

  • files can be downloaded very efficiently using S3 interface and idc-index

  • IDC Portal
    Slim
    OHIF Viewer
    idc-index
    CC BY 4.0 license
    here
    [email protected]
    NCI Imaging Data Commons Zenodo community
    Cancer Research Data Commons (CRDC)
    Cancer Data Aggregator (CDA)
    Genomics Data Commons
    Proteomics Data Commons
    Integrated Canine Data Commons
    FireCloud
    Seven Bridges Cancer Genimics Cloud
    this preprint
    MHub.AI

    IDC Portal

    You can download data at the patient/case, DICOM study or series levels directly from the IDC Portal interface, as demonstrated below!

    Portal tutorial

    The Imaging Data Commons Portal provides a web-based interactive interface to browse the data hosted by IDC, visualize images, build manifests describing selected cohorts, and download images defined by the manifests.

    The slides below give a quick guided overview of how you can use IDC Portal.

    No login is required to use the portal, to visualize images, or to download data from IDC!

    Acknowledgments

    • Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.

    • We gratefully acknowledge and the that support public hosting of IDC-curated content, and cover out-of-cloud egress fees!

    • Several of the members of the IDC team utilize compute resources supported via the

    Data selection and download

    IDC Portal offers lots of flexibility in selecting items to download. In all cases, download of data from IDC Portal is a two step process:

    1. Select items and export a manifest corresponding to your selection.

    2. Use command-line python tool or 3D Slicer IDC browser extension to download the files for your selection, as discussed in .

    Downloading data

    If you have questions or feedback about the download tools provided by IDC, please reach out via our - we are very interested in hearing your feedback and suggestions!

    IDC supports a variety of interfaces for fetching individual images, cohorts (groups of images), or portions of images, using desktop application, command-line interface, or programmatic API. These interfaces are covered in the subsequent pages. You should select the specific approach to accessing IDC data depending on your requirements.

    • Download directly from : no prerequisites other than a Chrome web browser!

    DICOM data model

    DICOM defines its own model to map relevant entities from the real world. That model, as , is shown in the figure below.

    The DICOM data model is implicit, and is not defined in a machine-readable structured form by the standard!

    DICOM data model entities do not always map to DICOM objects! In fact, every DICOM object you will ever encounter in IDC will contain attributes describing various properties of the entries at different levels of the real world data model. Such objects are called Composite Information Objects. The of the Composite Information Object Definitions is shown below, and covers all of the composite objects defined by the standard.

    Configuring your search

    Noteworthy details about the search scope and configuration

    • Numbers in the grayed ovals next to the search filters indicate the total number of cases (patients) that have the specific attribute

    DICOM-TIFF dual personality files

    DICOM and TIFF are two different image file formats that share many similar characteristics, and are capable of encoding exactly the same pixel data, whether uncompressed, or compressed with common lossy schemes (including JPEG and JPEG 2000). This allow the pixel data to be losslessly transformed from one format to the other and back.

    The DICOM file format was also deliberately designed to allow the two formats (TIFF and DICOM) to peacefully co-exist in the same file, sharing the same pixel data without expanding the file size significantly. This is achieved by leaving some unused space at the front of the DICOM file ("preamble"), which allows for the presence of a TIFF format recognition code ("magic number") and a pointer to its Image File Directory (IFD), which in turn contains pointers into the shared DICOM Pixel Data element.

    The dual-personality mechanism supports both traditional strip-based TIFF organization, such as might be used to encode a single frame image, as well as the tile-based format, which is commonly used for Whole Slide Images (WSI), and which is encoded in DICOM with each tile as a frame of a "multi-frame" image.

    Unlike TIFF files, which allow multiple different sized images to be encoded in the same file, DICOM does not, so there are limits to this approach. For example, though an entire WSI pyramid can be encoded in a TIFF file, the DICOM WSI definition requires each pyramid layer to be in a separate file, and all frames (tiles) within the same file to be the same size.

    Most of the structural metadata that describes the organization and encoding of the pixel data is similar in DICOM and TIFF. It is copied into the tags (data elements) encoded in the respective format "headers". Biomedical-specific information, such as patient, specimen and anatomical identifiers and descriptions, as well as acquisition technique, is generally only encoded in the DICOM data elements, their being no corresponding standard TIFF tags for it. Limited spatial information (such as physical pixel size) can be encoded in TIFF tags, but more complex multi-dimensional spatial location is standardized only in the DICOM data elements.

    More dashboard examples

    Topic-specific dashboards

    Proxy policy

    TL;DR: if you want to download images from IDC, you can do it without charge, limits or sign-ins from our cloud storage buckets. See instructions in .

    The primary mechanism for accessing data from IDC is by searching the metadata using the idc-index python package or BigQuery tables, and downloading the binary files from public cloud buckets, as discussed in . There is no limit, quota or fees associated with downloading IDC files from the buckets.

    Effective March 2024, as a pilot project, IDC also provides access to the DICOM data via the DICOMweb interface available at this endpoint: . This endpoint is read-only. It will route the requests to the Google Healthcare API DICOM store containing IDC data.

    Our DICOMWeb endpoint should only be used when data access needs cannot be satisfied using other mechanisms (e.g., when accessing individual frames of the microscopy images without having to download the entire binary file).

    Egress of data via the DICOMweb interface is capped at a non-disclosed limit that is tracked per IP. It is not acceptable to “IP hop” in an attempt to circumvent individual daily quotas, since there is also a global daily cap as well to prevent full egress of the imaging collection. Note that if this global cap is hit, all other users of the site would be unable to use the viewers for the rest of the day (using the UTC clock). Thus, IP hopping against the proxy that causes the global quota to be hit will be considered a denial-of-service attack.

    Getting Started

    As of 15-July-2025, support for user defined cohorts has been removed from the both the IDC API V1 and the IDC API V2. The API documentation has been revised accordingly.

    This section describes version 2 of the IDC REST API . The documentation for the version 1 API will be found .

    This API is designed for use by developers of image analysis and data mining tools to directly query the public resources of the IDC and retrieve information into their applications. The API complements the IDC web application but eliminates the need for users to visit the IDC web pages to perform manifest export, and transfer of image data to some local file system.

    The IDC API conforms to the specification which "defines a standard, language-agnostic interface to RESTful APIs which allows both humans and computers to discover and understand the capabilities of the service without access to source code, documentation, or through network traffic inspection."

    BigQuery

    Get started with using BigQuery with IDC data following our

    contains various examples of using BigQuery for searching DICOM metadata in IDC BQ tables.

    Learning resources

    Getting Started

    Note: As of 15-July-2025, support for user defined cohorts has been removed from the both the IDC API V1 and the IDC API V2. The API documentation has been revised accordingly.

    This section describes v1 of the IDC REST API . This API is designed for use by developers of image analysis and data mining tools to directly query the public resources of the IDC and retrieve information into their applications. The API complements the IDC web application but eliminates the need for users to visit the IDC web pages to perform cohort creation, manifest export, and transfer of image data to some local file system.

    The IDC API conforms to the specification which "defines a standard, language-agnostic interface to RESTful APIs which allows both humans and computers to discover and understand the capabilities of the service without access to source code, documentation, or through network traffic inspection."

    If you have feedback about the desired features of the IDC API, please let us know via the IDC

    Slide microscopy

    This section of the documentation complements the tutorials available in our notebooks repository:

    • : all of the pathology images in IDC are in DICOM Slide Microscopy format; this notebook will help you get started with using this representation and also searching IDC pathology images.

    • : introduction to the key metadata accompanying IDC slide microscopy images that can be used for subsetting data and building cohorts.

    In the following subsections you will find notebooks that don't require python programming, or have dependencies that make them not suitable for the python notebook format.

    Exploring search results

    Search results are updated dynamically based on the search configuration. At any time you can expand the items on the right to explore the selected collections, cases, studies and series.

    Noteworthy details about navigating search results

    • Studies and series tables include the button to open those in the browser-based image viewer.

    Endpoint Details

    See IDC API endpoint details at .

    Looker dashboards

    is a free tool that turns your data into informative, easy to read, easy to share, and fully customizable dashboards and reports.

    If you would like to share an interesting Looker Studio dashboard that uses IDC/cloud for imaging research, please let us know and we would be happy to review and reference it from the IDC documentation!

    In this section you can learn how to very quickly make a custom Looker Studio dashboard to explore the content of your cohort, and find some additional examples of using Looker Studio for analyzing content of IDC.

    NCI Cloud Resources

    are components of the that bring data and computational power together to enable cancer research and discovery.

    Our current experience in using NCI Cloud Resources for cancer image analysis is summarized in the following preprint:

    Thiriveedhi, V. K., Krishnaswamy, D., Clunie, D., Pieper, S., Kikinis, R. & Fedorov, A. Cloud-based large-scale curation of medical imaging data using AI segmentation. Research Square (2024). doi:

    Additional tools

    Since IDC data is available via standard interfaces, you can use any of the tool supporting those interfaces to access the data. This page provides pointers to some of such tools that you might find useful.

    If you are aware of any other tool that is not listed here, but is helpful for accessing IDC data, please let us know on the , and we will be happy to add it here!

    S3 API

    • : open-source Python interface to ,

    Compute engine

    This section contains various recipes that might be useful in utilizing GCP Compute Engine (GCE).

    You are also encouraged to review the slides in the following presentation that provides an introduction into GCE, and shares some best practices the its usage.

    W. Longabaugh. Introduction to Google Cloud Platform. Presented at MICCAI 2021. ()

    The dictionary of TIFF tags can be extended with application-specific entries. This has been done for various non-medical and medical applications (e.g., GeoTIFF, DNG, DEFF). Other tools have used alternative mechanisms, such as defining text string (Leica/Aperio SVS) or structured metadata in other formats (such as XML for OME) buried within a TIFF string tag (e.g, ImageDescription). This approach can be used with DICOM-TIFF dual-personality files as well, since DICOM does not restrict the content of the TIFF tags; it does require updating or crafting of the textual metadata to actually reflect the characteristics of the encoded pixel data.

    It is hoped that the dual-personality approach may serve to mitigate the impact of limited support of one format or the other in different clinical and research tools for acquisition, analysis, storage, indexing, distribution, viewing and annotation.

    For further information and an example open source implementation, see

    Clunie, D. A. Dual-Personality DICOM-TIFF for whole slide images: A migration technique for legacy software. J. Pathol. Inform. 10, 12 (2019).

    idc-index interface: command-line and Python API interface to download images corresponding to the specific patient/study/series, or a cohort defined by a manifest

  • 3D Slicer interface: desktop application to download images corresponding to the specific patient/study/series, or a cohort defined by a manifest

  • s5cmd: command-line interface to download images for a cohort defined by a manifest (unlike idc-index, does not organize downloaded images into folders corresponding to IDC data model hierarchy)

  • DICOMweb interface: REST API interface to access both metadata and pixel data at the granularity of image frames/tiles

  • Directly loading DICOM objects from Google Cloud or AWS in Python: Python API interface to access both metadata and pixel data at the granularity of image frames/tiles

  • forum
    IDC Portal

    MRI dashboard

    Program- and Collection-specific dashboards

    • HTAN program dashboard

    • RMS-Mutation-Prediction collection dashboard

    • LIDC-IDRI collection dashboard (see details in this paper)

    • TotalSegmentator-CT-Segmentations

    Digital pathology / slide microscopy dashboard
    Preclinical images dashboard
    CT contrast images exploration

    If you reach your daily quota, but feel you have a compelling cancer imaging research use case to request an exception to the policy and an increase in your daily quota, please reach out to us at [email protected] to discuss the situation.

    We are continuously monitoring the usage of the proxy. Depending on the actual costs and usage, this policy may be revisited in the future to restrict access via the DICOMweb interface for any uses other than IDC viewers.

    this article
    this article
    https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb

    If you have feedback about the desired features of the IDC API, please let us know via the IDC support forum.

    The API is a RESTful interface, accessed through web URLs. There is no software that an application developer needs to download in order to use the API. The application developer can build their own access routines using just the API documentation provided. The interface employs a set of predefined query functions that access IDC data sources.

    The IDC API is intended to enable exploration of IDC hosted data without the need to understand and use the Structure Query Language (SQL). To this end, data exploration capabilities through the IDC API are limited. However, IDC data is hosted using the standard capabilities of the the Google Cloud Platform (GCP) Storage (GCS) and BigQuery (BQ) components. Therefore, all of the capabilities provided by GCP to access GCS storage buckets and BQ tables are available for more advanced interaction with that data.

    Other API Documentation

    SwaggerUI is a web based interface that allows users to try out APIs and easily view their documentation. You can access the IDC API SwaggerUI here.

    This Google Colab notebook serves as an interactive tutorial to accessing the IDC API using Python.

    here
    OpenAPI 2.0

    Query performance considerations

    • https://cloud.google.com/bigquery/docs/best-practices-performance-overview

    • https://cloud.google.com/bigquery/docs/best-practices-performance-compute

    • Temporary tables: https://cloud.google.com/bigquery/docs/multi-statement-queries#temporary_tables

    Getting Started tutorial
    This document
    Intro to SQL course on Kaggle
    digital pathology self-guided python notebooks
    Getting started with pathology images in IDC
    Exploring IDC slide microscopy images metadata
    https://api.imaging.datacommons.cancer.gov/v1/swagger
    Google Looker Studio
    NCI Cloud Resources
    NCI Cancer Research Data Commons
    10.21203/rs.3.rs-4351526/v1
    slides in Google Slides

    V1 API

    Allocations program, which is in turn funded by the US National Science Foundation. You can see instructions on how to get your allocation at
    .
    Google Public Data Program
    AWS Open Data Sponsorship Program
    JetStream2
    ACCESS
    this documentation page
    .

    The API is a RESTful interface, accessed through web URLs. There is no software that an application developer needs to download in order to use the API. The application developer can build their own access routines using just the API documentation provided. The interface employs a set of predefined query functions that access IDC data sources.

    The IDC API is intended to enable exploration of IDC hosted data without the need to understand and use the Structure Query Language (SQL). To this end, data exploration capabilities through the IDC API are limited. However, IDC data is hosted using the standard capabilities of the the Google Cloud Platform (GCP) Storage (GCS) and BigQuery (BQ) components. Therefore, all of the capabilities provided by GCP to access GCS storage buckets and BQ tables are available for more advanced interaction with that data.

    Other API Documentation

    SwaggerUI is a web based interface that allows users to try out APIs and easily view their documentation. You can access the IDC API SwaggerUI here.

    This Google Colab notebook serves as an interactive tutorial to accessing the IDC API using Python.

    OpenAPI 2.0
    support forum
    ,
    , & other S3-compliant APIs, powered by Rust

    IDC forum
    obstore
    Amazon S3
    Google Cloud Storage
    Azure Storage
    "IDC manifest" is a text file that contains URLs to the files in cloud buckets that correspond to your selection. It will contain one line for each DICOM series, as IDC files are organized in series-level folders in the cloud storage.

    Downloading content using Cart

    You will see "Cart" icon in the search results collections/cases/studies/series tables. Any of the items in these tables can be added to the cart for subsequent downloading of the corresponding files.

    Get the manifest for the cart content using "Manifest" button in the Cart panel.

    Downloading all of the files for the current search configuration

    Clicking "Manifest" button in the "Cohort Filters" panel will given you the manifest for all of the studies that match your current selection criteria.

    Download individual studies or series

    Studies table contains a button for downloading manifest that will contain references to the files in the given study. To download a single series, no manifest is needed. You will see the command line to run to do the download.

    Downloading images in the viewers

    If you would like to download the entire study, or the specific image you see in the image viewer, you can use the download button in the viewer interface.

    Button to toggle download instructions in IDC radiology (OHIF v3) viewer
    Button to toggle download instructions in IDC microscopy (Slim) viewer
    Downloading data
    DICOM Composite Instance IOD Information Model

    As can be observed from this diagram, "each Composite Instance IOD [Entity-Relationship] Model requires that all Composite Instances that are part of a specific Study shall share the same context. That is, all Composite Instances within a specific Patient Study share the same Patient and Study information; all Composite Instances within the same Series share the same Series information; etc." (ref).

    Each of the boxes in the diagram above corresponds to Information Entities (IEs), which in turn are composed from Information Modules. Information Modules group attributes that are related. As an example, Patient IE included in the MR object will include Patient Information Module, which in turn will include such attributes as PatientID, PatientName, and PatientSex.

    To recap, individual files in IDC buckets are instances of DICOM composite objects. The specific type of object is defined by theSOPClassUID.Metadata included in a given object is captured by the Information Object Definition as a list of Information Entities. Object definition also contains the list of specific Information Modules for the individual IEs. Note that modules included in a given IE are object-specific, while list of attributes included in a given Information Module is fixed!

    defined in Part 3 of the standard
    information model
    DICOM model of the real world

    Click on the "i" button to toggle information panel about the individual items in the search panels

  • Cohort filters panel: get the shareable URL for the current selection by clicking "URL" button in the Cohort Filters panel

  • Get the manifest for downloading all of the matching studies by clicking "Manifest" button in the Cohort Filters panel

  • Note how Cohort Filters section is updated as your selection changes
    • You can copy identifiers of the individual collections, cases, studies or series to the clipboard - those can be used to download corresponding files as discussed in the Downloading data section - using command-line download tool or 3D Slicer IDC extension

    Drill down the results of the selection by expanding the content of the individual collections, cases and studies

    Frequently Asked Questions

    What is the difference between IDC and TCIA?

    IDC and TCIA are partners in providing FAIR data for cancer imaging researchers.

    TCIA provides unique service to work with data submitters to de-identify cancer imaging data and make it available for download.

    The mission of IDC is to support efficient access and use of the cancer imaging data, after it was de-identified and released.

    Here are some of the highlights that make IDC unique:

    • Unique datasets: while all of the public TCIA DICOM collections are available in IDC, there is a growing amount of data in IDC that is not available anywhere else:

      • DICOM digital pathology collections from prominent initiatives: Childhood Cancer Data Initiative (CCDI), GTEx, TCGA, CPTAC, HTAN, CMB

      • image analysis results available only from IDC, such as TotalSegmentator segmentations and radiomics features for most of the CT images in the NLST collection

    • Cloud-native: IDC makes the data available in public cloud buckets, the egress is free (TCIA provides download from on-premises servers at a single institution): chances are your will be able to download data from IDC much faster than from TCIA

    • Partnerships with cloud vendors: IDC collaborates with Public Datasets Programs of Amazon Web Services and Google Cloud to support hosting and free out-of-cloud egress, contributing to improved accessibility, sustainability and longevity of the resource

    • State of the art tools: IDC maintains superior community recognized tools to support the use of the data:

      • modern OHIF Viewer v3 for radiology data, with support of visualization of annotations and segmentations;

      • Slim viewer for digital pathology and annotations

      • highly capable IDC Portal

    • Standard access interfaces: IDC offers standard interfaces for data access: S3 API for file download, DICOMweb for interoperability with DICOM tools, SQL for searching all of the DICOM metadata (TCIA offers various non-standard, in-house interfaces and APIs for data access)

    • Harmonized data: All of the data (radiology and digital pathology images, annotations, segmentations, image-derived features) available in IDC is harmonized into DICOM representation, which means

      • interoperability: you can use IDC data with any DICOM-compatible tool

      • metadata: every single file in IDC is accompanied by metadata that follows DICOM data model, and is associated with unique identifiers, allowing you to build reproducible cohorts

      • uniform representation: you don't need to customize your processing pipelines to a specific collection, and can build cohorts combining data across collections

    • Co-location with cloud compute resources: IDC data is easier to access from cloud computing resources, allowing you to more easily experiment with the new analysis tools and scale your computation

    • Versioning: IDC data is versioned: you will be able to access the exact files you analyzed in a given verison of IDC even if there were any updates to the collection after you accessed it, helping you achieve reproducibility of your analyses

    • Open-source tool stack: all of the tools developed by IDC are shared under permissive licenses to support community contribution, reuse and sustainability

    How to download data from IDC?

    Check out the documentation page!

    How do I get my data into IDC?

    Note that currently IDC prioritizes submissions from NCI-funded driving projects and data from special selected projects.

    • If you would like to submit images, it will be your responsibility to de-identify them first, documenting the de-identification process and submitting that documentation for the review by IDC stakeholders.

    • We welcome submissions of image-derived data (expert annotations, AI-generated segmentations) for the images already in IDC, see IDC Zenodo community to learn about the requirements for such submissions!

    IDC works closely with and mirrors TCIA public collections. If you submit your DICOM data to TCIA and your data is released as a public collection, it will be automatically available in IDC in a following release.

    If you are interested in making your data available within IDC, please contact us by sending email to .

    How much does it cost to use the cloud?

    IDC data is stored in the cloud buckets, and you can search and for free and without login.

    If you would like to use the cloud for analysis of the data, we recommend you start with the free tier of to get free access to a cloud-hosted VM with GPU to experiment with analysis workflows for IDC data. If you are an NIH-funded researcher, you may be eligible for a free allocation via . US-based researchers can also access free cloud-based computing resources via .

    What is the status of IDC?

    IDC pilot release took place in Fall 2020, followed by the production release in September 2021. IDC team is continuously refining the capabilities of IDC Portal and various tools, and publishes new data releases every 3-4 months.

    What data is available?

    We host most of the public collections from . We also host HTAN and other pathology images not hosted by TCIA. You can review the complete, up-to-date list of .

    How to acknowledge IDC?

    Please cite the latest paper from the IDC team. Please also make sure you acknowledge the specific data collections you used in your analysis.

    Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W. L., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National cancer institute imaging data commons: Toward transparency, reproducibility, and scalability in imaging artificial intelligence. Radiographics 43, (2023).

    Where do I learn more about other components of CRDC?

    The main website for the Cancer Research Data Commons (CRDC) is

    What about non-imaging data that accompanies IDC collections?

    Clinical data that was shared by the submitters is available for a number of imaging collections in IDC. Please see on how to search that data and how to link clinical data with imaging metadata!

    Many of the imaging collections are also accompanied by the genomics or proteomics data. CRDC provides the API to locate such related datasets.

    I want to search IDC content using an attribute not available in the portal

    IDC Portal gives you access to just a small subset of the metadata accompanying IDC images. If you want to learn more about what is available, you have several options:

    • from our Getting Started tutorial series explains how to use - a python package that aims to simplify access to IDC data

    • will help you get started with searching IDC metadata in BigQuery, which gives you access to all of the DICOM metadata extracted from IDC-hosted files

    • if you are not comfortable writing queries or coding in pyhon, you can use to search using some of the attributes that are not available through the portal. You can also to include additional attributes.

    Data model

    IDC relies on DICOM data model for organizing images and image-derived data. At the same time, IDC includes certain attributes and data types that are outside of the DICOM data model. The Entity-Relationship (E-R) diagram and examples below summarize a simplified view of the IDC data model (you will find the explanation of how to interpret the notation used in this E-R diagram in this page from Mermaid documentation).

    IDC content is organized in Collections: groups of DICOM files that were collected through certain research activity.

    Collections are organized into Programs, which group related collections, or those collections that were contributed under the same funding initiative or a consortium. Example: TCGA program contains TCGA-GBM, TCGA-BRCA and other collections. You will see Collections nested under Programs in the upper left section of the IDC Portal. You will also see the list of collections that meet the filter criteria in the top table on the right-hand side of the portal interface.

    Individual DICOM files included in the collection contain attributes that organize content according to the DICOM data model.

    Each collection will contain data for one or more case, or patient. Data for the individual patient is organized in DICOM studies, which group images corresponding to a single imaging exam/enconter, and collected in a given session. Studies are composed of DICOM series, which in turn consist of DICOM instances. Each DICOM instance correspond to a single file on disk. As an example, in radiology imaging, individual instances would correspond to image slices in multi-slice acquisitions, and in digital pathology you will see a separate file/instance for each resolution layer of the image pyramid. When using IDC Portal, you will never encounter individual instances - you will only see them if you download data to your computer.

    Analysis results collection is a very important concept in IDC. These contain analysis results that were not contributed as part of any specific collection. Such analysis results might be contributed by investigators unrelated to those that submitted the analyzed images, and may span images across multiple collections.

    Coding schemes

    One of the fundamental principles of DICOM is the use of controlled terminologies, or lexicons, or coding schemes (for the purposes of this guide, these can be used interchangeably). While using the DICOM data stored in IDC, you will encounter various situations where the data is captured using coded terms.

    Controlled terminologies define a set of codes, and sometimes their relationships, that are carefully curated to describe entities for a certain application domain. Consistent use of such terminologies helps with uniform data collection and is critical for harmonization of activities conducted by independent groups.

    When codes are used in DICOM, they are saved as triplets that consist of

    • CodeValue: unique identifier for a term

    • CodingSchemeDesignator: code for the authority that issued this code

    • CodeMeaning: human-readable code description

    DICOM relies on various sources of codes, all of which are listed in of the standard.

    As an example, if you query the view with the following query in the BQ console:

    You will see columns that contain coded attributes of the segment. In the example below, the value of AnatomicRegion corresponding to the segment is assigned the value (T-04000, SRT, Breast), where "SRT" is the coding scheme designator corresponding to the coding scheme.

    As another example, quantitative and qualitative measurements extracted from the SR-TID1500 objects are stored in the and views, respectively. If we query those views to see the individual measurements, they also show up as coded items. Each of the quantitative measurements includes a code describing the quantity being measured, the actual numeric value, and a code describing the units of measurement:

    DICOM Structured Reports

    DICOM SR uses data elements to encode a higher level abstraction that is a tree of content, where nodes of the tree and their relationships are formalized. SR-TID1500 is one of many standard templates that define constraints on the structure of the tree, and is intended for generic tasks involving image-based measurements. DICOM SR uses standard terminologies and codes to deliver structured content. These codes are used for defining both the concept names and values assigned to those concepts (name-value pairs). Measurements include coded concepts corresponding to the quantity being measured, and a numeric value accompanied by coded units. Coded categorical or qualitative values may also be present. In SR-TID1500, measurements are accompanied by additional context that helps interpret and reuse that measurement, such as finding type, location, method and derivation. Measurements computed from segmentations can reference the segmentation defining the region and the image segmented, using unique identifiers of the respective objects.

    At this time, only the measurements that accompany regions of interest defined by segmentations are exposed in the IDC Portal, and in the measurements views maintained by IDC!

    Open source DCMTK tool can be used to render the content of the DICOM SR tree in a human-readable form (you can see one example of such rendering ). Reconstructing this content using tools that operate with DICOM content at the level of individual attributes can be tedious. We recommend the tools referenced above that also provide capabilities for reading and writing SR-TID1500 content:

    • : high-level DICOM abstractions for the Python programming language

    • : open source DCMTK-based C++ library and command line converters that aim to help with the conversion between imaging research formats and the standard DICOM representation for image analysis results

    • : C++ library that provides API abstractions for reading and writing SR-TID1500 documents

    Tools referenced above can be used to 1) extract qualitative evaluations and quantitative measurements fro the SR-TID1500 document; 2) generate standard-compliant SR-TID1500 objects.

    SR-TID1500-specific metadata attributes are available in the table views maintained by IDC. See details .

    Original objects

    We differentiate between the original and derived DICOM objects in the IDC portal and discussions of the IDC-hosted data. By Original objects we mean DICOM objects that are produced by image acquisition equipment - MR, CT, or PET images fall into this category. By Derived objects we mean those objects that were generated by means of analysis or annotation of the original objects. Those objects can contain, for example, volumetric segmentations of the structures in the original images, or quantitative measurements of the objects in the image.

    Original objects

    Most of the images stored on IDC are saved as objects that store individual slices of the image in separate instances of a series, with the image stored in the PixelData attribute.

    As of production release, IDC contains both radiology and digital pathology images. The following publication can serve as a good introduction into the use of DICOM for digital pathology.

    Herrmann, M. D., Clunie, D. A., Fedorov, A., Doyle, S. W., Pieper, S., Klepeis, V., Le, L. P., Mutter, G. L., Milstone, D. S., Schultz, T. J., Kikinis, R., Kotecha, G. K., Hwang, D. H., Andriole, K. P., John Lafrate, A., Brink, J. A., Boland, G. W., Dreyer, K. J., Michalski, M., Golden, J. A., Louis, D. N. & Lennerz, J. K. Implementing the DICOM standard for digital pathology. J. Pathol. Inform. 9, 37 (2018).

    Open source libraries such as DCMTK, GDCM, ITK, and pydicom can be used to parse such files and load pixel data of the individual slices. Recovering geometry of the individual slices (spatial location and resolution) and reconstruction of the individual slices into a volume requires some extra consideration.

    You can use one of the existing tools to reconstruct image volume instead of implementing sorting of the slices on your own:

    • : command-line tool to convert neuroimaging data from the DICOM format to the NIfTI format

    • : open source software for image computation, which includes

    • : python library providing API and command-line tools for converting DICOM images into NIfTI format

    Data known issues

    1. Indexing of the collection of NSCLC-Radiomics by the Data Commons Framework is pending.

    2. QIN multi-site collection of Lung CT data with Nodule Segmentations: only items corresponding to the LIDC-IDRI original collection are included

    3. DICOM SR of clinical data and measurement for breast cancer collections to TCIA: only items corresponding to the ISPY1 original collection are included

    4. : Some of the segmentations in this collection are empty (as an example, SeriesNumber 42100 with SeriesDescription "VOI PE Segmentation thresh=70" in is empty).

    5. Due to the existing limitations of Google Healthcare API, not all of the DICOM attributes are extracted and are available in BigQuery tables. Specifically:

      • sequences that have more than 15 levels of nesting are not extracted (see ) - we believe this limitation does not affect the data stored in IDC

      • sequences that contain around 1MiB of data are dropped from BigQuery export and RetrieveMetadata output currently. 1MiB is not an exact limit, but it can be used as a rough estimate of whether or not the API will drop the tag (this limitation was not documented as of writing this) - we know that some of the instances in IDC will be affected by this limitation. The fix for this limitation is targeted for sometime in 2021, according to the communication with Google Healthcare support.

    Organization of data

    IDC provides a variety of interfaces to access both the data (as files) and metadata (to subset files and build cohorts). The flow of data and the relationship between the various components IDC uses is summarized in the following figure.

    We maintain the following resources to enable access to IDC data:

    • Cloud storage buckets: files maintained by IDC are mirrored between Google and AWS public storage buckets that provide fee-free egress without requiring login. The buckets organize files by DICOM series, each series stored in a separate folder. Given the large overall size of data in IDC, you will likely need to use one of the search interfaces to identify relevant series first.

    • BigQuery tables: collection-level metadata, DICOM metadata, clinical data tables available via SQL query interface.

    • Python API: pip-installable provides programmatic interface and command-line tools to search IDC data using most important metadata attributes, and to download files corresponding to the selected cohorts from the cloud buckets

    • : alternative language-independent API for selecting subsets of data

    • : DICOM files and metadata queries available from Google Healthcare DICOM stores

    DICOM Radiotherapy Structure Sets

    DICOM Radiotherapy Structure Sets (RTSS, or RTSTRUCT) define regions of interest by a set of planar contours.

    RTSS objects can be identified by the RTSTRUCT value assigned to the Modality attribute, or by SOPClassUID = 1.2.840.10008.5.1.4.1.1.481.3.

    If you use the IDC Portal, you can select cases that include RTSTRUCT objects by selecting "Radiotherapy Structure Set" in the "Original" tab, "Modality" section (filter link). Here is a sample study that contains an RTSS series.

    As always, you get most of the power in exploring IDC metadata when using SQL interface. As an example, the query below will select a random study that contains a RTSTRUCT series, and return a URL to open that study in the viewer:

    # get the viewer URL for a random study that 
    #  contains RTSTRUCT modality
    SELECT
      ANY_VALUE(CONCAT("https://viewer.imaging.datacommons.cancer.gov/viewer/", StudyInstanceUID)) as viewer_url
    FROM
      `bigquery-public-data.idc_current.dicom_all`
    WHERE
      StudyInstanceUID IN (
      # select a random DICOM study that includes an RTSTRUCT object
      SELECT
        StudyInstanceUID
      FROM
        `bigquery-public-data.idc_current.dicom_all`
      WHERE
        SOPClassUID = "1.2.840.10008.5.1.4.1.1.481.3"
      ORDER BY
        RAND()
      LIMIT
        1)

    Metadata

    RTSTRUCT relies on unstructured text in describing the semantics of the individual regions segmented. This information is stored in the StructureSetROISequence.ROIName attribute. The following query will return the list of all distinct values of ROIName and their frequency.

    Conversion into alternative representations

    We recommend tool for converting planar contours of the individual structure sets into volumetric representation.

    Accessing the API

    The following characteristics apply to all IDC APIs:

    • You access a resource by sending an HTTP request to the IDC API server. The server replies with a response that either contains the data you requested, or a status indicator.

    • An API request URL has the following structure: <BaseURL><API version><QueryEndpoint>?<QueryParameters>. For example, this curl command is a request for metadata on all IDC collections:

      curl -X GET "https://api.imaging.datacommons.cancer.gov/v1/collections" -H "accept: application/json"

    API Endpoints

    Authorization

    Some of the APIs, such as /collections and /cohorts/preview, can be accessed without authorization. APIs that access user specific data, such as cohorts, necessarily require account authorization.

    To access these APIs that require IDC authorization, you will need to generate a credentials file. To obtain your credentials:

    • Clone the to your local machine.

    • Execute the idc_auth.py script either through the command line or from within python. Refer to the idc_auth.py file for detailed instructions.

    Example usage of the generated authorization is demonstrated by code in the Google Colab notebook.

    Paged queries

    Several IDC APIs, specifically /cohorts/manifest/preview, /cohorts/manifest/{cohort_id}, /cohorts/query/preview, /cohorts/query/{cohort_id}, and /dicomMetadata, are paged. That is, several calls of the API may be required to return all the data resulting from such a query. Each accepts a _page_size query parameter that is the maximum number of objects that the client wants the server to return. The returned data from each of these APIs includes a next_page value. next_page is null if there is no more data to be returned. If next_page is non-null, then more data is available.

    There are corresponding queries, /cohorts/manifest/nextPage, /cohorts/query/nextPage, and /dicomMetadata/nextpage endpoints, that each accept two query parameters: next_page, and page_size. In the case that the returned next_page value is not null, the corresponding ../nextPage endpoint is accessed, passing the next_page token returned by the previous call.

    Timeouts

    The manifest and query endpoints may return an HTTP 202 error. This indicates that the request was accepted but processing timed out before it was completed. In this case the client should resubmit the request including the next_page token that was returned with the error response.

    Endpoint Details

    This page provides details on each of the IDC API endpoints.

    Getting started

    Overview

    The Imaging Data Commons Portal provides a web-based interactive interface to browse the data hosted by IDC, visualize images, build manifests describing selected cohorts, and download images defined by the manifests.

    The slides below give a quick guided overview of how you can use IDC Portal.

    No login is required to use the portal, to visualize images, or to download data from IDC!

    Exploring and subsetting data

    Overview

    Components on the left side of the page give you controls for configuring your selection:

    • Search scope allows you to limit your search to just the specific programs, collections and analysis results (as discussed in the documentation of the IDC Data model).

    • Search configuration gives you access to a small set of metadata attributes to select DICOM studies (where "DICOM studies" fit into IDC data model is also discussed in the page) that contain data that meets the search criteria.

    Panels on the right side will automatically update based on what you select on the left side!

    • Selection configuration reflects the active search scope/filters in the Cohort Filters section. You can download all of the studies that match your filters. Below you will see the Cart section. Cart is helpful when selecting data by individual filters is too imprecise, and you want to have more granular control over your selection by selecting specific collections/patients/studies/series.

    • Filtering results section consists of the tables containing matching content that you can navigate following IDC Data model: first table shows the matching collections, selecting a collection will list matching cases (patients), selection of a case will populate the next table listing matching studies for the patient, and finally selecting a study will expand the final table with the list of series included in the study.

    In the following sections of the documentation you will learn more about each of the items we just discussed.

    Introduction

    Data sources

    Most of the data in IDC is received from the data collection initiatives/projects supported by US National Cancer Institute. Whenever source images or image-derived data is not in the DICOM format, it is harmonized into DICOM as part of the ingestion.

    As of data release v21, IDC sources of data include:

    DICOM Segmentations

    DICOM Segmentation object (SEG) can be identified by SOPClassUID= 1.2.840.10008.5.1.4.1.1.66.4 Unlike most "original" image objects that you will find in IDC, SEG belongs to the family of enhanced multiframe image objects, which means that it stores all of the frames (slices) in a single object. SEG can contain multiple segments, a segment being a separate label/entity being segmented, with each segment containing one or more frames (slices). All of the frames for all of the segments are stored in the PixelData attribute of the object.

    If you use the IDC Portal, you can select cases that include SEG objects by selecting "Segmentations" in the "Modality" section () under the "Original" tab . Here is that contains a SEG series.

    You can further explore segmentations available in IDC via the "Derived" tab of the Portal by filtering those by specific types and anatomic locations. As an example, will select cases that contain segmentations of a nodule.

    Derived objects

    Make sure you complete IDC notebooks to get introduced into IDC data organization, download, visualization and other first-order topics.

    In this section we discuss derived DICOM objects, including annotations, that are stored in IDC. It is important to recognize that, in practice, annotations are often shared in non-standard formats. When IDC ingests a dataset where annotations are available in such a non-standard representation, those need to be harmonized into a suitable DICOM object to be available in IDC. Due to the complexity of this task, we are unable to perform such harmonization for all of the datasets. If you want to check if there are annotations in non-DICOM format available for a given collection, you should locate the original source of the data, and examine the accompanying documentation for available non-DICOM annotations.

    As an example, the collection is available in IDC. If you mouse over the name of that collection in the IDC Portal, the tooltip will provide the overview of the collection and the link to the source.

    Using VS Code with GCP VMs

    has a useful feature of allowing you to develop code on a remote VM from the convenience of your desktop. You can follow the steps below to configure your development environment for this task.

    Prerequisites

    • `` installed on your computer

    Downloading data
    Curation policy
    The Cancer Imaging Archive (TCIA)
    [email protected]
    download data from IDC
    Google Colab
    NIH Cloud Lab
    ACCESS program allocations
    The Cancer Imaging Archive (TCIA)
    collections included in IDC
    https://doi.org/10.1148/rg.230180
    https://datacommons.cancer.gov/
    this tutorial
    Cancer Data Aggregator (CDA)
    this notebook
    idc-index
    this more advanced notebook
    this DataStudio dashboard
    extend this dashboard
    dsr2html
    here
    highdicom
    dcmqi
    DCMTK
    here
    SimpleITK: python interface to the Insight Toolkit, includes DICOM read functionality.
    PubMed
    dcm2niix
    Plastimatch
    command-line file format converter
    dicom2nifti
    ISPY1 (ACRIN 6657)
    this study
    https://cloud.google.com/bigquery/docs/nested-repeated
    IDC-Examples git repository
    How_to_use_IDC_APIs.ipynb
    SELECT
      structureSetROISequence.ROIName AS ROIName,
      COUNT(DISTINCT(SeriesInstanceUID)) AS ROISeriesCount
    FROM
      `bigquery-public-data.idc_current.dicom_all`
    CROSS JOIN
      UNNEST (StructureSetROISequence) AS structureSetROISequence
    WHERE
      SOPClassUID = "1.2.840.10008.5.1.4.1.1.481.3"
    GROUP BY
      ROIName
    ORDER BY
      ROISeriesCount DESC
    Plastimatch convert
    Metadata

    Metadata describing the segments is contained in the SegmentSequence of the DICOM object, and is also available in the BigQuery table view maintained by IDC in the bigquery-public-data.idc_current.segmentations BigQuery table. That table contains one row per segment, and for each segment includes metadata such as algorithm type and structure segmented.

    Conversion into alternative representations

    We recommend you use one of the following tools to interpret the content of the DICOM SEG and convert it into alternative representations:

    • dcmqi: open source DCMTK-based C++ library and command line converters that aim to help with the conversion between imaging research formats and the standard DICOM representation for image analysis results

    • highdicom: high-level DICOM abstractions for the Python programming language

    • DCMTK: C++ library that provides API abstractions for reading and writing SEG objects

    Tools referenced above can be used to 1) extract volumetrically reconstructed mask images corresponding to the individual segments stored in DICOM SEG; 2) extract segment-specific metadata describing its content; 3) generate standard-compliant DICOM SEG objects from research formats.

    SEG-specific metadata attributes are available in the table views maintained by IDC. See details here.

    filter link
    a sample study
    this filter
    # get the viewer URL for a random study that 
    #  contains SEG modality
    SELECT
      ANY_VALUE(CONCAT("https://viewer.imaging.datacommons.cancer.gov/viewer/", StudyInstanceUID)) as viewer_url
    FROM
      `bigquery-public-data.idc_current.dicom_all`
    WHERE
      StudyInstanceUID IN (
      # select a random DICOM study that includes a SEG object
      SELECT
        StudyInstanceUID
      FROM
        `bigquery-public-data.idc_current.dicom_all`
      WHERE
        SOPClassUID = "1.2.840.10008.5.1.4.1.1.66.4"
      ORDER BY
        RAND()
      LIMIT
        1)

    all DICOM files from the public collections are mirrored in IDC

  • a subset of digital pathology collections and analysis results harmonized from vendor-specific representation (as available from TCIA) into DICOM Slide Microscopy (SM) format

  • Childhood Cancer Data Initiative (CCDI) (ongoing)

    • digital pathology slides harmonized into DICOM SM

  • Genomic Data Commons (GDC)

    • The Cancer Genome Atlas (TCGA) slides harmonized into DICOM SM

  • Human Tumor Atlas Network (HTAN)

    • release 1 of the HTAN data harmonized into DICOM SM

  • National Library of Medicine Visible Human Project

    • v1 of the Visible Human images harmonized into DICOM MR/CT/XC

  • Genotype-Tissue Expression Project (GTex)

    • digital pathology slides harmonized into DICOM SM

  • The list of all of the IDC collections is available in IDC Portal here: https://portal.imaging.datacommons.cancer.gov/collections/.

    Data provenance

    Whenever IDC replicates data from a publicly available source, we include the reference to the origin:

    • from the IDC Portal Explore page, click on the "i" icon next to the collection in the collections list

    • source_doi metadata column contains Digital Object Identifier (DOI) at the granularity of the individual files and is available both via python idc-index package (see this tutorial on how to access it) and BigQuery interfaces

    Whenever source data is harmonized into DICOM, the DOI will correspond to a Zenodo entry for the result of harmonization, which in turn will reference the location where data can be accessed in the native format (if available). As an example, IDC NLM-Visible-Human-Project collection refers to this DOI that describes the dataset resulting from the original dataset harmonized into DICOM https://doi.org/10.5281/zenodo.12690049, which in turn references the NLM Visible Human project page containing information on accessing the original files collected by the project.

    Check out Data release notes for information about the collections added in the individual IDC data releases.

    Data ingestion process

    Simplified workflow for IDC data ingestion is summarized in the following diagram.

    The Cancer Imaging Archive (TCIA) (ongoing)

    You will also find the link to the source in the list of collections available in IDC.

    Finally, if you select data using SQL, you can use the source_DOI and/or the source_URL column to identify the source of each file in the subset you selected (learn more about source_DOI, licenses and attribution in the part 3 of our Getting started tutorial).

    For the collection in question, the source DOI is https://doi.org/10.7937/e4wt-cd02, and on examining that page you will see a pointer to the CSV file with the coordinates of the bounding boxes defining regions containing lesions.

    Non-standard annotations are not searchable, usually are not possible to visualize in off-the-shelf tools, and require custom code to interpret and parse. The situation is different for the DICOM derived objects that we discuss in the following sections.

    DICOM derived objects

    In IDC we define "derived" DICOM objects as those that are obtained by analyzing or post-processing the "original" image objects. Examples of derived objects can be annotations of the images to define image regions, or to describe findings about those regions, or voxel-wise parametric maps calculated for the original images.

    Although DICOM standard provides a variety of mechanisms that can be used to store specific types of derived objects, most of the image-derived objects currently stored in IDC fall into the following categories:

    • voxel segmentations stored as DICOM Segmentation objects (SEG)

    • segmentations defined as a set of planar regions stored as DICOM Radiotherapy Structure Set objects (RTSTRUCT)

    • quantitative measurements and qualitative evaluations for the regions defined by DICOM Segmentations, those will be stored as a specific type of DICOM Structured Reporting (SR) objects that follows DICOM SR template TID 1500 "Measurements report" (SR-TID1500)

    The type of the object is defined by the object class unique identifier stored in the SOPClassUID attribute of each DICOM object. In the IDC Portal we allow the user to define the search filter based on the human-readable name of the class instead of the value of that identifier.

    You can find detailed descriptions of these objects applied to specific datasets in TICA in the following open access publications:

    Fedorov, A., Clunie, D., Ulrich, E., Bauer, C., Wahle, A., Brown, B., Onken, M., Riesmeier, J., Pieper, S., Kikinis, R., Buatti, J. & Beichel, R. R. DICOM for quantitative imaging biomarker development: a standards based approach to sharing clinical data and structured PET/CT analysis results in head and neck cancer research. PeerJ 4, e2057 (2016). https://peerj.com/articles/2057/

    Fedorov, A., Hancock, M., Clunie, D., Brochhausen, M., Bona, J., Kirby, J., Freymann, J., Pieper, S., J W L Aerts, H., Kikinis, R. & Prior, F. DICOM re-encoding of volumetrically annotated Lung Imaging Database Consortium (LIDC) nodules. Med. Phys. (2020). doi:10.1002/mp.14445

    "Getting started" tutorial
    Breast-Cancer-Screening-DBT

    Visual Studio Code installed on your computer

  • A GCP VM you want to use for code development is up and running

  • Step 1: Install "Remote - SSH" extension

    Step 2: Populate SSH config files

    Run the following command to populate SSH config files with host entries for each VM instance you have running

    Step 3: Connect to host

    If the previous step completed successfully, you should see the running VMs in the Remote Explorer of VS Code, as in the screenshot below, and should be able to open a new session to those remove VMs.

    Note that the SSH configuration may/will change if you restart your VM. In this case you will need to re-configure (re-run step 2 above).

    Visual Studio Code
    gcloud SDK
    PS3.16 Section 8
    canceridc-data:idc_views.segmentations
    SNOMED-CT
    canceridc-data:idc_views.quantitative_measurements
    canceridc-data:idc_views.qualitative_measurements
    idc-index package
    REST API
    DICOMweb
    IDC Data model

    DICOM stores

    If you would like to access IDC data via DICOMweb interface, you have two options:

    1. IDC-maintained DICOM store available via proxy

    2. DICOM store maintained by Google Healthcare

    In the following we provide details for each of those options.

    IDC-maintained DICOM store via proxy

    This store contains all of the data for the current IDC data release. It does not require authentication and is available via the following DICOMweb URL of the proxy (you can ignore the "viewer-only-no-downloads" part in the URL, it is a legacy constraint that is no longer applicable).

    DICOMweb URL:

    Limitations:

    • since all requests go through the proxy before reaching the DICOM store, you may experience reduced performance as compared to direct access you can achieve using the store described in the following section

    • there are per-IP and overall daily quotas, as described in IDC , that may not be sufficient for your use case

    DICOM store maintained by Google Healthcare

    This store replicates all of the data from the idc-open-data bucket, which contains most of the data in IDC (learn more about the organization of data in IDC buckets from ).

    DICOMweb URL (note the store name includes the IDC data release version that corresponds to its content: idc-store-v21):

    This DICOM store is documented in .

    Limitations:

    • most, but not all of the IDC data is available in this store

    • authentication with a Google account is required (anyone signed in with a Google account can access this interface, no whitelisting is required!)

    • since this DICOM store is not maintained directly by the IDC team, it may lag behind the latest IDC release in content in the future

    DICOMweb usage tutorials

    Check out and the accompanying Colab notebook to learn more.

    Frequently Asked Questions

    When you say most data is available in the Google Healthcare maintained DICOM store - what is "most"?

    TL;DR: as of IDC v21, it is 95.89% of all of the DICOM series available in IDC (IDC-maintained DICOM store has all of the 100%).

    Google Healthcare maintained DICOM store contains the latest versions of the DICOM series stored in the idc-open-data Google Storage bucket (see for details on buckets organization).

    You can get the exact number of DICOM series in each of the buckets with the following python code (before running it, do pip install --upgrade idc-index):

    As of IDC v21, the result of running the code above is the following, showing that 95.89% of DICOM series in IDC are available from the Google Healthcare maintained DICOM store (IDC-maintained DICOM store has all of the 100%).

    Are the two DICOM stores kept in sync?

    TL;DR: our goal is to have the two stores in sync within 1-2 weeks of each IDC data release.

    The DICOM store maintained by IDC is updated by the IDC team with each new release.

    The DICOM store maintained by Google Healthcare is populated after the release. We hope to have that done within 1-2 weeks after the IDC release. As a new release of IDC data is out, there will be a new DICOM store maintained by Google Healthcare, and the connection to the IDC release version will be indicated in the store name. I.e., when IDC v22 is released, whenever you are able to access https://healthcare.googleapis.com/v1/projects/nci-idc-data/locations/us-central1/datasets/idc/dicomStores/ idc-store-v22/dicomWeb , it is expected to be in sync.

    Colab notebooks

    This section contains various pointers that may be helpful when working with Google Colab.

    Google Colaboratory, or “Colab” for short, is a product from Google Research. Colab allows anybody to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education. More technically, Colab is a hosted Jupyter notebook service that requires no setup to use, while providing free access to computing resources including GPUs.

    If you would like to share an interesting notebook that uses IDC/cloud for imaging research, please let us know and we would be happy to review and reference it here!

    IDC maintained notebooks

    IDC Colab example notebooks are maintained in this repository:

    Notebook demonstrating deployment and application of abdominal structures segmentation tool to IDC data, developed for the course:

    Community-contributed notebooks

    • , contributed by , Mayo Clinic

    • , contributed by , Mayo Clinic

    • Notebooks contributed by , ISB-CGC, demonstrating the utility of BigQuery in correlative analysis of radiomics and genomics data:

    Relevant resources

    • Colab limitations:

    • Transferring data between Colab and Google Drive:

    • Potentially interesting sources of example notebooks:

    Introduction to DICOM

    IDC relies on DICOM for data modeling, representation and communication. Most of the data stored in IDC is in DICOM format. If you want to use IDC, you (hopefully!) do not need to become a DICOM expert, but you do need to have a basic understanding of how DICOM data is structured, and how to transform DICOM objects into alternative representations that can be used by the tools familiar to you.

    This section is not intended to be a comprehensive introduction to the standard, but rather a very brief overview of some of the concepts that you will need to understand to better use IDC data.

    If you are looking for an introduction to the DICOM standard, consider this book:

    Digital Imaging and Communications in Medicine (DICOM): A Practical Introduction and Survival Guide, 2nd Edition by Pianykh, Oleg S. published by Springer (2011). https://www.springer.com/gp/book/9783642108495

    As discussed in , the main mechanism for accessing the data stored in IDC is by using the storage buckets that contain individual files indexed through other interfaces. Each of the files in the IDC-maintained storage buckets encodes a DICOM object. Each DICOM object is a collection of data elements or attributes. Below is an example of a subset of attributes in a DICOM object, as generated by the IDC OHIF Viewer (which can be toggled by clicking the "Tag browser" icon in the IDC viewer toolbar):

    The standard defines constraints on what kind of data each of the attributes can contain. Every single attribute defined by the standard is listed in the , which defines those constraints:

    • Value Representation (VR) defines the type of the data that data element can contain. There are 27 DICOM VRs, and they are defined in .

    • Value Multiplicity (VM) defines the number of items of the prescribed VR that can be contained in a given data element.

    What attributes are included in a given object is determined by the type of object (or, to follow the DICOM nomenclature, Information Object). is dedicated to the definitions (IODs) of those objects.

    It is critical to recognize that while all of the DICOM files at the high level are structured exactly in the same way and follow the same syntax and encoding rules, interpretation of the content of an individual file is dependent on the specific type of object it encodes!

    How do you know what object is encoded in a given file (or instance of the object, using the DICOM lingo)? For this purpose there is an attribute SOPClassUID that uniquely identifies the class of the encoded object. The content of this attribute is not easy to interpret, since it is a unique identifier. To map it to the specific object class name, you can consult the complete list of object classes available in .

    When you use the IDC portal to build your cohort, unique identifiers for the object classes are mapped to their names, which are available under the "Object class" group of facets in the search interface.

    A somewhat related attribute that hints at the type of object is Modality, which is defined by the standard as "Type of equipment that originally acquired the data used to create the images in this Series", and is expected to take one of the values from . However, Modality is not equivalent to SOPClassUID, and should not be used as a substitute. As an example it is possible that data derived from the original modality could be saved as a different object class, but keep the value of modality identical.

    You can see all of the attributes of a DICOM instance using IDC Viewer by clicking the "Tag browser" button in the viewer toolbar, or you can use one of the many publicly available tools to examine the DICOM attributes of a downloaded DICOM file.

    Using a BQ Manifest to Load DICOM Files onto a VM

    Once a manifest has been created, typically the next step is to load the files onto a VM for analysis, and the easiest way to do this is to create your manifest in a BigQuery table and then use that to direct the file loading onto a VM. This guide shows how this can be done,

    Step 1: Export a file manifest for your cohort into BigQuery.

    The first step is to export a file manifest for a cohort into BigQuery. You will want to copy this table into the project where you are going to run your VM. Do this using the Google BQ console, since the exported table can be accessed only using your personal credentials provided by your browser. The table copy living in the VM project will be readable by the service account running your VM.

    Step 2: Start up a VM

    Start up your VM. If you have many files, you will want to speed the loading process by using a VM with multiple CPUs. Google describes the various , but is not very specific about ingress bandwidth. However, in terms of published egress bandwidth, the larger machines certainly have more. Experimentation showed that an n2-standard-8 (8 vCPUs, 32 GB memory) machine could load 20,000 DICOM files in 2 minutes and 32 secconds, using 16 threads on 8 CPUs. That configuration reached a peak throughput of 68 MiB/s.

    You also need to insure the machine has enough disk space. One of the checks in the script provided below is to calculate the total file load size. You might want to run that portion of the script and resize the disk as needed before actually doing the load.

    Step 3: Install the code provided

    performs the following steps:

    • Performs a query on the specified BigQuery manifest table and creates a local manifest file on your VM.

    • Performs a query that maps the GCS URLs of each file into DICOM hierarchical directory paths, and writes this out as a local TSV file on your VM.

    • Performs a query that calculates the total size of all the downloads, and reports back if there is sufficient space on the filesystem to continue.

    To install the code on your VM and then setup the environment:

    You then need to customize the settings in the script:

    Finally, run the script:

    IDC team

    Imaging Data Commons is being developed by a team of engineers and imaging scientists with decades of experience in cancer imaging informatics, cloud computing, imaging standards, security, open source tool development and data sharing.

    Our team includes the following sites and project leads:

    • Brigham and Women's Hospital, Boston, MA, USA (BWH)

      • Andrey Fedorov, PhD, and Ron Kikinis, MD - Co-PIs of the project

    idc-index and 3D Slicer

    Depending on whether you would like to download data interactively or programmatically, we provide two recommended tools to help you.

    Command-line or programmatic download: idc-index python package

    is a python package designed to simplify access to IDC data. Assuming you have Python installed on your computer (if for some reason you do not have Python, you can check out legacy download instructions ), you can get this package with pip like this:

    Once installed, you can use it to explore, search, select and download corresponding files as shown in the examples below. You can also take a look at a short tutorial on using idc-index

    Resolving CRDC Globally Unique Identifiers (GUIDs)

    An IDC manifest may include study and/or series GUIDs that can be resolved to the underlying DICOM instance files in GCS. Such use of GUIDs in a manifest enables a much shorter manifest compared to a list of per-instance GCS URLs. Also, as explained below, a GUID is expected to be resolvable even when the data which it represents has been moved.

    From the specification:

    "The Data Repository Service (DRS) API provides a generic interface to data repositories so data consumers, including workflow systems, can access data objects in a single, standard way regardless of where they are stored and how they are managed. The primary functionality of DRS is to map a logical ID to a means for physically retrieving the data represented by the ID."

    In IDC, we use the term GUID

    Accessing the API

    IDC API UI

    The can be used to see details about the syntax for each call, and also provides an interface to test requests.

    Make a Request

    $ gcloud compute config-ssh
    SELECT
      *
    FROM
      `canceridc-data.idc_views.segmentations`
    LIMIT
      10

    How to compare tumor features with gene expression data

  • How to compare tumor features with mutation data

  • SimpleITK notebooks
  • https://github.com/mdai/ml-lessons/

  • http://uwmsk.org/jupyter/

  • https://github.com/JoaoSantinha/Medical_Image_Analysis_Workshop

  • Google Colab Tips for Power Users: https://amitness.com/2020/06/google-colaboratory-tips/

  • Mounting GCS bucket using gcsfuse: https://pub.towardsai.net/connect-colab-to-gcs-bucket-using-gcsfuse-29f4f844d074

  • Almost-free Jupyter Notebooks on Google Cloud: https://www.tensorops.ai/post/almost-free-jupyter-notebooks-on-google-cloud

  • https://github.com/ImagingDataCommons/IDC-Tutorials/tree/master/notebooks
    2021 RSNA Deep Learning Lab
    https://tinyurl.com/RSNA21-IDC-TCIA
    Deep-learning Nodule Segmentation using LIDC dataset on Google Cloud Platform
    Kuan (Kevin) Zhang
    3D Brain MRI Classification Using Imaging Data Commons, Google Cloud, and NIFTI files
    Pouria Rouzokh
    Fabian Seidl
    https://research.google.com/colaboratory/faq.html
    https://gist.github.com/yt114/dc5d2fd4437f858bb73e38f0aba362c7

    Deepa Krishnaswamy, PhD

  • Katie Mastrogiacomo

  • Maria Loy

  • Institute for Systems Biology, Seattle, WA, USA (ISB)

    • David Gibbs, PhD - site PI

    • William Clifford, MS

    • Suzanne Paquette, MS

  • General Dynamics Information Technology, Bethesda, MD, USA (GDIT)

    • David Pot, PhD - site PI

    • Fabian Seidl

  • Fraunhofer MEVIS, Bremen, Germany (Fraunhofer MEVIS)

    • André Homeyer, PhD - site PI

    • Daniela Schacherer, MS

    • Henning Höfener, PhD

  • Massachusetts General Hospital, Boston, MA, USA (MGH)

    • Chris Bridge, DPhil - site PI

  • Radical Imaging LLC, Boston, MA, USA (Radical Imaging)

    • Rob Lewis, PhD - site PI

    • Igor Octaviano

  • PixelMed Publishing, Bangor, PA, USA (PixelMed)

    • David Clunie, MB, BS - site PI

  • Isomics Inc, Cambridge, MA, USA (Isomics)

    • Steve Pieper, PhD - site PI

  • Oversight:

    • Leidos Biomedical Research

      • Ulrike Wagner - project manager

      • Todd Pihl - project manager

    • National Cancer Institute

      • Erika Kim - federal lead

      • Granger Sutton - federal lead

    IDC Alumni

    We are grateful to the following individuals who contributed to IDC in the past, but are no longer directly involved in the development of IDC.

    • William Longabaugh, MS (ISB)

    • George White (ISB)

    • Ilya Shmulevich, PhD (ISB)

    • Poojitha Gundluru (GDIT)

    • Prema Venkatesun (GDIT)

    • Chris Gorman, PhD (MGH)

    • Pedro Kohler (Radical Imaging)

    • Hugo Aerts, PhD (BWH)

    • Cosmin Ciausu, MS (BWH)

    • Keyvan Farahani (NCI)

    • Markus Herrmann (MGH)

    • Davide Punzo (Radical Imaging)

    • James Petts (Radical Imaging)

    • Erik Ziegler (Radical Imaging)

    • Gitanjali Chhetri (Radical Imaging)

    • Rodrigo Basilio (Radical Imaging)

    • Jose Ulloa (Radical Imaging)

    • Madelyn Reyes (GDIT)

    • Derrick Moore (GDIT)

    • Mark Backus (GDIT)

    • Rachana Manandhar (BWH)

    • Rasmus Kiehl (Fraunhofer MEVIS)

    • Chad Osborne (GDIT)

    • Afshin Akbarzadeh (BWH)

    • Dennis Bontempi (BWH)

    • Vamsi Thiriveedhi (BWH)

    • Jessica Cienda (GDIT)

    • Bernard Larbi (GDIT)

    • Mi Tian (ISB)

    IDC team during face-to-face project meeting at ISB, May 11, 2023
    Proxy policy
    this documentation article
    https://cloud.google.com/healthcare-api/docs/resources/public-datasets/idc
    this tutorial
    this article
    Uses a multi-threaded bucket reader to pull the files from the GCS buckets and places them in the appropriate DICOM hierarchical directory.
    machine types
    This Python script
    to mean a persistent identifier that can be resolved to a GA4GH DrsObject. GUID persistence ensures that the data which the GUID represents can continue to be located and accessed even if it has been moved to a different hosting site.

    As described in the Data Versioning section, a UUID identifies a particular version of an IDC data object. There is a UUID for every version of every DICOM instance, series, and study in IDC hosted data. Each such UUID can be used to form a GUID that is registered by the NCI Cancer Research Data Commons (CRDC), and can be used to access the data that defines that object.

    This is a typical UUID: 641121f1-5ca0-42cc-9156-fb5538c14355 of a (version of a) DICOM instance, and this is the corresponding CRDC GUID: dg.4DFC/641121f1-5ca0-42cc-9156-fb5538c14355

    A GUID can be resolved by appending it to this URL, which is the GUID resolution service within CRDC: https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects/ . For example, the following curl command:

    >> curl https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects/dg.4DFC/641121f1-5ca0-42cc-9156-fb5538c14355

    returns:

    which is a DrsObject. Because we resolved the GUID of an instance, the access_methods in the returned DrsObject includes a URL at which the corresponding DICOM entity can be accessed.

    When the GUID of a series is resolved, the DrsObject that is returned does not include access methods because there are no series file objects. Instead, the contents component of the returned DrsObject contains the URLs that can be accessed to obtain the DrsObjects of the instances in the series. Thus, we see that when we resolve dg.4DFC/cc9c8541-949d-48d9-beaf-7028aa4906dc, the GUID of the series containing the instance above:

    curl -o foo https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects/dg.4DFC/cc9c8541-949d-48d9-beaf-7028aa4906dc

    we see that the contents component includes the GUID of that instance as well as the GUID of another instance:

    Similarly, the GUID of a DICOM study resolves to a DrsObject whose contents component consists of the GUIDs of the series in that study.

    At this time, most GUIDs have not been registered with the CRDC. If such a GUID is presented to the CRDC for resolution, an HTTP 404 error is returned.

    As discussed in the Organization of data section of this document, the DICOM instance file naming convention changed with IDC version 2. At this time, when an instance GUID is resolved, the returned DrsObject may method may include a URI to the V1 GCS bucket location. Those GUID will re-indexed such that in the future they point to the new GCS bucket location.

    GA4GH Data Repository Service API
    For a quick demonstration of the syntax of an API call, test the GET/collections request. You can experiment with this endpoint by clicking the ‘Try it out’ button.

    The API will return collection metadata for the current IDC data version. The request can be run by clicking on the ‘Execute’ button.

    Request Response

    The Swagger UI submits the request and shows the curl command that was submitted. The ‘Response body’ section will display the response to the request. The expected format of the response to this API request is shown below:

    The actual JSON formatted response can be downloaded by selecting the ‘Download’ button.

    The syntax for all of API data structures is detailed at the bottom of the UI page.API Endpoints

    Command line API access

    The API can be accessed from the command line using curl or wget. Here we discuss using curl for this purpose.

    Make a request

    You access an API endpoint by sending an HTTP request to the IDC API server. The server replies with a response that either contains the data you requested, or a status indicator. An API request URL has the following structure:

    <BaseURL><API version><QueryEndpoint>?<QueryParameters>.

    The <BaseURL> of the IDC API is https://api.imaging.datacommons.cancer.gov. For example, this curl command requests metadata on all IDC collections from the V2 API:

    curl -X GET "https://api.imaging.datacommons.cancer.gov/v2/collections" -H "accept: application/json" Note, also, that the HTTP method defaults to GET. However, a POST or DELETE HTTP method must be specified with the -X parameter.

    The IDC API UI displays the curl commands which it issues and thus can be a good reference when constructing your own curl commands.

    Programmed Access

    We expect that most API access will be programmed access, and, moreover, that most programmed access will be within a Python script using the Python Requests package. This usage is covered in detail (along with details on each of the IDC API endpoints) in the How_to_use_the_IDC_V2_API Google Colab notebook. Here we provide just a brief overview.

    In Python, we can issue the following request to obtain a list of the collections in the current IDC version:

    Paged queries

    The /cohorts/manifest/preview endpoints are paged. That is, several calls of the API may be required to return all the data resulting from such a query. Each endpoint accepts a page_size parameter in the manifestBody or manifestPreviewBody that is the maximum number of rows that the client wants the server to return. The returned data from each of these APIs includes a next_page value. next_page is null if there is no more data to be returned. If next_page is non-null, then more data is available.

    In the case that the returned next_page value is not null, the /cohorts/manifest/preview/nextPage endpoint can be accessed, passing the next_page token returned by the previous call.

    Timeouts

    The manifest endpoints may return an HTTP 202 error. This indicates that the request was accepted but processing timed out before it was completed. In this case, the client should resubmit the request including the next_page token that was returned with the error response.

    IDC API UI
    https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb
    https://healthcare.googleapis.com/v1/projects/nci-idc-data/locations/us-central1/datasets/idc/dicomStores/idc-store-v21/dicomWeb
    from idc_index import IDCClient
    
    c=IDCClient()
    
    query = """
    SELECT aws_bucket, COUNT(DISTINCT(SeriesInstanceUID)) AS num_series
    FROM index
    GROUP BY aws_bucket
    ORDER BY num_series DESC
    """
    
    c.sql_query(query)
    
    aws_bucket	   num_series
    idc-open-data	   911781
    idc-open-data-cr    34634
    idc-open-data-two    4473
    sudo apt-get install -y git # If you have a fresh VM and need git:
    cd ~
    git clone https://github.com/ImagingDataCommons/IDC-Examples.git
    cd IDC-Examples/scripts
    chmod u+x *.sh
    ./setupVM.sh
        TABLE = 'your-project-id.your-dataset.your-manifest-table' # BQ table with your manifest
        MANIFEST_FILE = '/path-to-your-home-dir/BQ-MANIFEST.txt' # Where will the manifest file go
        PATHS_TSV_FILE = '/path-to-your-home-dir/PATHS.tsv' # Where will the path file go
        TARG_DIR = '/path-to-your-home-dir/destination' # Has to be on a filesystem with enough space. Directory should exist.
        PAYING = 'your-project-id' # Needed for IDC requester pays buckets though it is free to crossload to a cloud VM
        THREADS = 16 # (2 * number) of cpus seems to work best
    ~/IDC-Examples/scripts/runManifestPull.sh
    {
       "access_methods":[
          {
             "access_id":"gs",
             "access_url":{
                "url":"gs://idc-open/641121f1-5ca0-42cc-9156-fb5538c14355.dcm"
             },
             "region":"",
             "type":"gs"
          }
       ],
       "aliases":[
    
       ],
       "checksums":[
          {
             "checksum":"f338e8c5e3d8955d222a04d5f3f6e2b4",
             "type":"md5"
          }
       ],
       "contents":[
    
       ],
       "created_time":"2020-09-18T02:14:02.830862",
       "description":null,
       "form":"object",
       "id":"dg.4DFC/641121f1-5ca0-42cc-9156-fb5538c14355",
       "mime_type":"application/json",
       "name":null,
       "self_uri":"drs://nci-crdc.datacommons.io/dg.4DFC/641121f1-5ca0-42cc-9156-fb5538c14355",
       "size":"135450",
       "updated_time":"2020-09-18T02:14:02.830868",
       "version":"9e13fb30"
    }
    {
       "aliases":[
    
       ],
       "checksums":[
          {
             "checksum":"0512207cb222fa2f085bc110c8474fa2",
             "type":"md5"
          }
       ],
       "contents":[
          {
             "drs_uri":"drs://nci-crdc.datacommons.io/dg.4DFC/ccafd781-ef39-4d39-ad74-e09de1ada476",
             "id":"dg.4DFC/ccafd781-ef39-4d39-ad74-e09de1ada476",
             "name":null
          },
          {
             "drs_uri":"drs://nci-crdc.datacommons.io/dg.4DFC/641121f1-5ca0-42cc-9156-fb5538c14355",
             "id":"dg.4DFC/641121f1-5ca0-42cc-9156-fb5538c14355",
             "name":null
          }
       ],
       "created_time":"2020-12-04T19:11:58.072088",
       "description":"",
       "form":"bundle",
       "id":"dg.4DFC/cc9c8541-949d-48d9-beaf-7028aa4906dc",
       "mime_type":"application/json",
       "name":"dg.4DFCcc9c8541-949d-48d9-beaf-7028aa4906dc",
       "self_uri":"drs://nci-crdc.datacommons.io/dg.4DFC/cc9c8541-949d-48d9-beaf-7028aa4906dc",
       "size":270902,
       "updated_time":"2020-12-04T19:11:58.072094",
       "version":""
    }
    {
      "collections": [
        {
          "cancer_type": "string",
          "collection_id": "string",
          "date_updated": "string",
          "description": "string",
          "doi": "string",
          "image_types": "string",
          "location": "string",
          "species": "string",
          "subject_count": 0,
          "supporting_data": "string",
        }
      ],
      "code": 200
    }
    response = requests.get("https://api.imaging.datacommons.cancer.gov/v2/collections")
    collections = response.json['collections']
    .

    Command line download interface

    With the idc-index package you get command line scripts that aim to make download simple.

    Have a .s5cmd manifest file you downloaded from IDC Portal or from the records in the IDC Zenodo community? Get the corresponding files as follows (you will also get download progress bar and the downloaded files will be organized in the collection/patient/study/series folder hierarchy!):

    You can use the same command to download files corresponding to any collection, patient, study or series, referred to by the identifiers you can copy from the portal!

    Copy collection ID from the IDC Portal interface

    Similarly, you can copy identifiers for patient/study/series and download the corresponding content!

    Programmatic download

    idc-index includes a variety of other helper functions, such as download from the manifest created using IDC portal, automatic generation of the viewer URLs, information about disk space needed for a given collection, and more. We are very interested in your feedback to define the additional functionality to add to this package! Please reach out via IDC Forum if you have any suggestions.

    Interactive download: 3D Slicer SlicerIDCBrowser extension

    3D Slicer is a free open source, cross-platform, extensible desktop application developed to support a variety of medical imaging research use cases.

    IDC maintains SlicerIDCBrowser, an extension of 3D Slicer, developed to support direct access to IDC data from your desktop. You will need to install a recent 3D Slicer 5.7.0 preview application (installers are available for Windows, Mac and Linux), and next use 3D Slicer ExtensionManager to install SlicerIDCBrowser extension. Take a look at the quick demo video in this post if you have never used 3D Slicer ExtensionManager before.

    Once installed, you can use SlicerIDCBrowser in one of the two modes:

    1. As an interface to explore IDC data: you can select individual collections, cases and DICOM studies and download items of interest directly into 3D Slicer for subsequent visualization and analysis.

    2. As download tool: download IDC content based on the manifest you created using IDC Portal, or identifiers of the individual cases, DICOM studies or series.

    Copy identifiers for the studies/series of interest from the IDC Portal
    Insert the identifiers in the appropriate fields, or download content defined by the s5cmd manifest
    idc-index
    here
    here
    Data organization
    DICOM Data Dictionary
    Part 5 of the standard
    Part 3 of the DICOM standard
    Part 4 here
    this list

    UUIDs and GUIDs

    As described in the Data Versioning section, a UUID identifies a particular version of an IDC data object. Thus, there is a UUID for every version of every DICOM instance in IDC hosted data. An IDC BigQuery manifest optionally includes the UUID (called a crdc_instance_uuid) of each instance (version) in the cohort.

    UIDs and UUIDs explained with an example

    Consider an instance in the CPTAC-CM collection that has this SOPInstanceUID: 1.3.6.1.4.1.5962.99.1.171941254.777277241.1640849481094.35.0\

    It is in a series having this SeriesInstanceUID: 1.3.6.1.4.1.5962.99.1.171941254.777277241.1640849481094.2.0

    The instance and series were added to the IDC Data set in IDC version 7. At that point, the instance was assigned UUID: 5dce0cf0-4694-4dff-8f9e-2785bf179267 and the series was assigned this UUID: e127d258-37c2-47bb-a7d1-1faa7f47f47a

    In IDC version 10, a revision of this instance was added (keeping its original SOPInstanceUID), and assigned this UUID: 21e5e9ce-01f5-4b9b-9899-a2cbb979b542

    Because this instance was revised, the series containing it was implicitly revised. The revised series was thus issued a new UUID: ee34c840-b0ca-4400-a6c8-c605cef17630

    Thus, the initial version of this instance has this file name: e127d258-37c2-47bb-a7d1-1faa7f47f47a/5dce0cf0-4694-4dff-8f9e-2785bf179267.dcm and the revised version of the instance has the this file name: ee34c840-b0ca-4400-a6c8-c605cef17630/21e5e9ce-01f5-4b9b-9899-a2cbb979b542.dcm

    Both versions of the instance are in both AWS and GCS buckets.

    Note that GCS and AWS bucket names are different. In fact, DICOM instance data is distributed across multiple buckets in both GCS and AWS. We will discuss obtaining GCS and AWS URLs more a little later.

    It is possible that a series is revised, but one or more instances in the series are not revised. For example if a single instance in a series (assume the series has a uuid <series_uuid_old>) is revised, that instance gets a new UUID, and there is implicitly a new version of the series, which gets a new UUID (call it <series_uuid_new>). If an instance that is not revised has UUID <invariant_instance_uuid>, then its corresponding file in cloud storage will the have name: <series_uuid_old>/<invariant_instance_uuid>.dcm in the "old" series. But, because that same instance version is in the revised series, there must also be a file in cloud storage named: <series_uuid_new>/<invariant_instance_uuid>.dcm The result will be two distinct but identical files.

    Utilities like gsutil, s3 and s5cmd "understand" the implied hierarchy in these file names. Thus the series UUID now acts like the name of a directory that contains all the instance versions in the series version:

    and similarly for AWS buckets, thus making it easy to transfer all instances in a series from the cloud.

    Because file names are more or less opaque, the user will not typically select files by listing the contents of a bucket. Instead, one should use either the IDC Portal or IDC BigQuery tables to identify items of interest and, then, generate a manifest of objects that can be passed to a utility like s5cmd.

    Resolving CRDC Globally Unique Identifiers (GUIDs)

    From the specification:

    "The Data Repository Service (DRS) API provides a generic interface to data repositories so data consumers, including workflow systems, can access data objects in a single, standard way regardless of where they are stored and how they are managed. The primary functionality of DRS is to map a logical ID to a means for physically retrieving the data represented by the ID."

    Each such UUID can be used to form a that has been indexed by the (DCF), and can be used to access data that defines that object. In particular this data includes the GCS and AWS URLs of the DICOM instance file. Though the GCS or AWS URL of an instance might change over time, the UUID of an instance can always be resolved to obtain its current URLs. Thus, for long term curation of data, it is recommended to record instance UUIDs.

    The data object returned by the server is a GA4GH DRS :

    This is a typical IDC instance UUID: 641121f1-5ca0-42cc-9156-fb5538c14355 of a (version of a) DICOM instance, and this is the corresponding DRS ID: dg.4DFC/641121f1-5ca0-42cc-9156-fb5538c14355

    A DRS ID can be resolved by appending it to the following URL, which is the resolution service within CRDC: https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects/ . For example, the following curl command:

    >> curl https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects/dg.4DFC/641121f1-5ca0-42cc-9156-fb5538c14355

    returns this DrsObject:

    AS can be seen, the access_methods component in the returned DrsObject includes a URL for each of the corresponding files in Google GCS and AWS S3.

    Publications

    Publications by the IDC team

    1. Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. RadioGraphics (2023). https://doi.org/10.1148/rg.230180

    2. Weiss, J., Bernatz, S., Johnson, J., Thiriveedhi, V., Mak, R. H., Fedorov, A., Lu, M. T. & Aerts, H. J. W. Opportunistic assessment of steatotic liver disease in lung cancer screening eligible individuals. J. Intern. Med. (2025).

    3. Thiriveedhi, V. K., Krishnaswamy, D., Clunie, D., Pieper, S., Kikinis, R. & Fedorov, A. Cloud-based large-scale curation of medical imaging data using AI segmentation. Research Square (2024).

    4. Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S., Aerts, H. J. W. L., Homeyer, A., Lewis, R., Akbarzadeh, A., Bontempi, D., Clifford, W., Herrmann, M. D., Höfener, H., Octaviano, I., Osborne, C., Paquette, S., Petts, J., Punzo, D., Reyes, M., Schacherer, D. P., Tian, M., White, G., Ziegler, E., Shmulevich, I., Pihl, T., Wagner, U., Farahani, K. & Kikinis, R. NCI Imaging Data Commons. Cancer Res. 81, 4188–4193 (2021).

    5. Gorman, C., Punzo, D., Octaviano, I., Pieper, S., Longabaugh, W. J. R., Clunie, D. A., Kikinis, R., Fedorov, A. Y. & Herrmann, M. D. Interoperable slide microscopy viewer and annotation tool for imaging data science and computational pathology. Nat. Commun. 14, 1–15 (2023).

    6. Bridge, C. P., Gorman, C., Pieper, S., Doyle, S. W., Lennerz, J. K., Kalpathy-Cramer, J., Clunie, D. A., Fedorov, A. Y. & Herrmann, M. D. Highdicom: a Python Library for Standardized Encoding of Image Annotations and Machine Learning Model Outputs in Pathology and Radiology. J. Digit. Imaging 35, 1719–1737 (2022).

    7. Schacherer, D. P., Herrmann, M. D., Clunie, D. A., Höfener, H., Clifford, W., Longabaugh, W. J. R., Pieper, S., Kikinis, R., Fedorov, A. & Homeyer, A. The NCI Imaging Data Commons as a platform for reproducible research in computational pathology. Comput. Methods Programs Biomed. 107839 (2023). doi:

    8. Krishnaswamy, D., Bontempi, D., Thiriveedhi, V., Punzo, D., Clunie, D., Bridge, C. P., Aerts, H. J., Kikinis, R. & Fedorov, A. Enrichment of the NLST and NSCLC-Radiomics computed tomography collections with AI-derived annotations. arXiv [cs.CV] (2023). at <>

    9. Bontempi, D., Nuernberg, L., Pai, S., Krishnaswamy, D., Thiriveedhi, V., Hosny, A., Mak, R. H., Farahani, K., Kikinis, R., Fedorov, A. & Aerts, H. J. W. L. End-to-end reproducible AI pipelines in radiology using the cloud. Nat. Commun. 15, 6931 (2024).

    10. Krishnaswamy, D., Bontempi, D., Thiriveedhi, V. K., Punzo, D., Clunie, D., Bridge, C. P., Aerts, H. J. W. L., Kikinis, R. & Fedorov, A. Enrichment of lung cancer computed tomography collections with AI-derived annotations. Sci. Data 11, 1–15 (2024).

    11. Murugesan, G. K., McCrumb, D., Aboian, M., Verma, T., Soni, R., Memon, F., Farahani, K., Pei, L., Wagner, U., Fedorov, A. Y., Clunie, D., Moore, S. & Van Oss, J. The AIMI Initiative: AI-Generated Annotations for Imaging Data Commons Collections. arXiv [eess.IV] (2023). at

    Publications referencing IDC (a subset)

    See the full list, as curated by Google Scholar, .

    1. Pai, S., Bontempi, D., Hadzic, I., Prudente, V., Sokač, M., Chaunzwa, T. L., Bernatz, S., Hosny, A., Mak, R. H., Birkbak, N. J. & Aerts, H. J. W. L. Foundation model for cancer imaging biomarkers. Nature Machine Intelligence 6, 354–367 (2024).

    2. Murugesan, G. K., McCrumb, D., Aboian, M., Verma, T., Soni, R., Memon, F. & Van Oss, J. The AIMI initiative: AI-generated annotations for imaging data commons collections. arXiv [eess.IV] (2023). at <>

    3. Kulkarni, P., Kanhere, A., Yi, P. H. & Parekh, V. S. Text2Cohort: Democratizing the NCI Imaging Data Commons with natural language cohort discovery. arXiv [cs.LG] (2023). at <>

    ACCESS allocations

    Advanced Cyberinfrastructure Coordination Ecosystem (ACCESS) is a program supported by the US National Science Foundation (NSF) to provide educators with free and convenient access to advanced computational resources.

    If you have a university email account, you can complete a relatively easy application process to receive an allocation of free credits that you can then use to create pre-configured GPU-enabled cloud-based linux virtual machines with desktop interface available via browser. You can use those machines, for example, to have a convenient access to an instance of 3D Slicer for experimenting with AI models, or for training DL networks.

    Example of the configurations available for the Ubuntu 22.04 base image
    GPU configurations available, from

    How to get started

    Follow these steps:

    1. Create an account and request an ACCESS allocation at this page: . There are 4 different levels, with each giving you a different number of “credits” that you use to create your VM instances. Each of these levels requires you to submit a different application. For the Explore ACCESS allocation (lowest tier), you need to write a simple abstract to justify why you needed these resources. Other tiers require more lengthy descriptions of what you’ll do with the ACCESS resources. In our experience, applications can be approved in as soon as a few days after submitting the application. You can be a PI and have multiple Co-PIs with you on the project, so you can all access the Jetstream2 resources.

    2. Once you get approved, your allocation is valid for a 12 month period, and you get half of the credits to start. To start using these credits you exchange them for Service Units (SUs) on different platforms. We experimented with the one called , which provides easy interface to cloud-based computing resources. If you want to use JetStream2, you will need to exchange your ACCESS credit allocation for JetStream2 SUs here: . Usually this exchange is approved within a few days if not less.

    Why we recommend ACCESS allocations

    • It is free for academics!

    • Very easy to set up. As of writing, there is no similar product available from Google Cloud, which would provide desktop access to a VM with a comparable ease of access. AWS provides , but we have yet to experiment to evaluate it.

    • You can do a lot with the basic credit allocation! Entry-level allocations can be on the order of 100,000s, while the burn rate is, for example, 8 SUs/hour for a medium size VM (8 CPUs/30 GB RAM). As a reference:

    Acknowledging ACCESS and JetStream2

    • JetStream2:

    • ACCESS:

    s5cmd

    Make sure you first review the Downloading data section to learn about the simpler interfaces that provide access to IDC data.

    SlicerIDCBrowser and idc-index discussed in the previous section aim to provide simple interfaces for data access. In some situations, however, you may want to build cohorts using metadata attributes that are not exposed in those tools. In such cases you will need to use BigQuery interface to form your cohort and build a file manifest that you can then use with s5cmd to download the files.

    With this approach you will follow a a 2-step process covered on this page:

    • Step 1: create a manifest - a list of the storage bucket URLs of the files to be downloaded. if you want to download the content of the cohort defined in the IDC Portal, , and proceed to Step 2. Alternatively, you can use BigQuery SQL as discussed below to generate the manifest;

    • Step 2: given the manifest, download files to your computer or to a cloud VM using s5cmd command line tool.

    To learn more about using Google BigQuery SQL with IDC, check out part 3 of our , which demonstrates how to query and download IDC data!

    Step 1: Create the manifest

    You will need to complete prerequisites described in in order to be able to execute the manifest generation queries below!

    A download manifest can be created using either the IDC Portal, or by executing a BQ query. If you have generated a manifest using the IDC Portal, as discussed , proceed to Step 2! In the remainder of this section we describe creating a manifest from a BigQuery query.

    The BigQuery table discussed in can be used to subset the files you need based on the DICOM metadata attributes as needed, utilizing the SQL query interface. The gcs_url and aws_url columns contain Google Cloud Storage and AWS S3 URLs, respectively, that can be used to retrieve the files.

    Start with the query templates provided below, modify them based on your needs, and save the result in a file query.txt. The specific values for PatientID, SeriesInstanceUID, StudyInstanceUID are chosen to serve as examples.

    You can use IDC Portal to identify items of interest, or you can use SQL queries to subset your data using any of the DICOM attributes. You are encouraged to use the to test your queries and explore the data first!

    Queries below demonstrate how to get the Google Storage URLs to download cohort files.

    If you want to download the files corresponding to the cohort from GCP instead of AWS, substitute series_aws_url for series_gcp_url in the SELECT statement of the query, such as in the following SELECT clause:

    Next, use a Google Cloud SDK bq query command (from command line) to run the query and save the result into a manifest file, which will be the list of GCP URLs that can be used to download the data.

    Make sure you adjust the --max_rows parameter in the queries above to be equal or exceed the number of rows in the result of the query, otherwise your list will be truncated!

    For any of the queries, you can get the count of rows to confirm that the --max_rows parameter is sufficiently large (use the to run these queries):

    You can also get the total disk space that will be needed for the files that you will be downloading:

    Step 2: Download the files defined by the manifest

    is a very fast S3 and local filesystem execution tool that can be used for accessing IDC buckets and downloading files both from GCS and AWS.

    Install s5cmd following the instructions in , or if you have Python pip on you system you can just do pip install s5cmd --upgrade.

    You can verify if your setup was successful by running the following command: it should successfully download one file from IDC.

    Once s5cmd is installed, you can use s5cmd run command to download the files corresponding to the manifest.

    If you defined manifest that references AWS buckets:

    If you defined manifest that references GCP buckets, you will need to specify GCS endpoint:

    3D Slicer desktop VM

    These instructions provide a reference example of how you can start up a traditional workstation desktop on a VM instance to run interactive applications like 3D Slicer and access the desktop via a conventional web browser. Two options are shown, either with or without a GPU. Note that GPUs are significantly more expensive so only enable it if needed. For 3D Slicer the main benefit of a GPU is for rendering, so operations like dicom processing and image segmentation are quite usable without a GPU. Even volume rendering is fairly usable if you choose the CPU rendering option. Other operations such as training machine learning models may benefit from an appropriate GPU.

    A motivation for using desktop applications like 3D Slicer on a VM is that their computing power close to the data, so heavy network operations such as storage bucket or dicom store access may be significantly faster than accessing the same resources from a remote machine. They are also highly configurable, so that you can easily allocate the number of cores or memory needed for a given task. Note that can even change these configurations so that, for example, you can shut down the machine, add a GPU and more memory, and then boot the same instance and pick up where you left off.

    In addition, these desktops are persistent in the sense that you can start a task such as labeling data for a machine learning task, disconnect your ssh session, and reconnect later to pick up where you left off without needing to restart applications or reload data. This can be convenient when tending long-running computations, accessing your work from different computers, or working on a network that sometimes disconnects.

    The instructions here are just a starting point. There are many cloud options available to manage access scopes for the service accounts, allocate disks, and configure other options.

    In order to follow these instructions, you will need to have a project that has billing enabled. Please follow the instructions in to set up billing.

    With a GPU

    You can launch a VM with a GPU in your project with a command like this in your local terminal (replace vm-name with a name for your machine):

    Once it boots in about 90 seconds you can type:

    Then you can open to get to your desktop.

    Without a GPU

    You can launch a VM without a GPU in your project with a command like this in your local terminal (replace vm-name with a name for your machine):

    Once it boots in about 90 seconds you can type:

    On the remote machine run:

    Each time you reboot the machine, run this:

    Then you can open to get to your desktop.

    Note

    This effort is a work in progress with a minimal desktop environment. Further refinement is expected and community contributions would be welcome! A description of the background and possible evolution of this work is .

    Dashboard for your cohort

    Use IDC-provided Looker Studio template to build a custom dashboard for your cohort

    You can use this Looker Studio template to build a custom dashboard for your own cohort, which will look like the screenshot below in three relatively simple steps.

    Screenshot of the DataStudio dashboard template you can use to explore your cohort.

    Step 1: Prepare the manifest BigQuery table

    Export the cohort manifest as a BigQuery table, and take note of the location of the resulting table.

    Name of the BQ table you will need is highlighted with the red rectangle.

    Step 2: Duplicate the template

    Open the dashboard template following this link: http://bit.ly/3jdCmON, and click "Use template" to make a copy of the dashboard.

    When prompted, do not change the default options, and click "Copy Report".

    Step 3: Configure data source

    Select "Resource > Manage added data sources"

    Select "Edit" action:

    Update the custom query as instructed. This will select all of the DICOM metadata available for the instances in your cohort.

    For example, if the location of your manifest table is canceridc-user-data.user_manifests.manifest_cohort_101_20210127_213746, the custom query that will join your manifest with the DICOM metadata will be the following:

    Once you updated the query, click "Reconnect" in the upper right corner.

    Make sure you select a valid Billing Project that you can use to support the queries!

    Accept the below, if prompted (you may also be notified about changes to the schema of the table, so the message may be different).

    Click "Done" on the next screen:

    Click "Close" on the next screen:

    You are Done! The dashboard for your cohort is now live: you can "View" it to interact with the content, you can edit it to explore additional attributes in the cohort, and you can choose to keep it private or share with a link!

    Schematic summary of the IDC data ingestion and release process.

    Data versioning

    Summary

    IDC updates its data offering at the intervals of 2-4 months, with the data releases timing driven by the availability of new data, updates of existing data, introduction of new capabilities and various priority considerations. You can see the historical summary of IDC releases in .

    When you work with IDC data at any given time, you should be aware of the data release version. If you build cohorts using filters or queries, the result of those queries will change as the IDC content is evolving. Building queries that refer to the specific data release version will ensure that the result is the same.

    Here is how you can learn what version of IDC data you are interacting with, depending on what interface to the data you are using:

    Organization of data in v1 (deprecated)

    v1 of IDC followed a different layout of data than subsequent version. Since the corresponding items are still available, we document it here for reference.

    IDC approach to storage and management of DICOM data is relying on the Google Cloud Platform . We maintain three representations of the data, which are fully synchronized and correspond to the same dataset, but are intended to serve different use cases.

    In order to access the resources listed below, it is assumed you have completed the to access Google Cloud console!

    All of the resources listed below are accessible under the .

    pip install idc-index --upgrade
    idc download manifest_file.s5cmd
    $ idc download pseudo_phi_dicom_data
    2024-09-04 17:59:50,944 - Downloading from IDC v18 index
    2024-09-04 17:59:50,952 - Identified matching collection_id: ['pseudo_phi_dicom_data']
    2024-09-04 17:59:50,959 - Total size of files to download: 1.27 GB
    2024-09-04 17:59:50,959 - Total free space on disk: 29.02233088GB
    2024-09-04 17:59:51,151 - Not using s5cmd sync as the destination folder is empty or sync or progress bar is not requested
    2024-09-04 17:59:51,156 - Initial size of the directory: 0 bytes
    2024-09-04 17:59:51,156 - Approximate size of the files that need to be downloaded: 1274140000.0 bytes
    Downloading data:   7%|█████                                                                     | 86.3M/1.27G [00:13<03:06, 6.36MB/s]
    # download all files for patient ID 100002
    $ idc download 100002
    # download all files for DICOM StudyInstanceUID 1.2.840.113654.2.55.192012426995727721871016249335309434385
    $ idc download 1.2.840.113654.2.55.192012426995727721871016249335309434385
    # download all files for DICOM SeriesInstanceUID 1.2.840.113654.2.55.305538394446738410906709753576946604022
    $ idc download 1.2.840.113654.2.55.305538394446738410906709753576946604022
    from idc_index import index
    
    client = IDCClient()
    
    # get identifiers of all collections available in IDC
    all_collection_ids = client.get_collections()
    
    # download files for the specific collection, patient, study or series
    client.download_from_selection(collection_id="rider_pilot", \
                                   downloadDir="/some/dir")
                                   
    client.download_from_selection(patientId="rider_pilot", \
                                   downloadDir="/some/dir")
    
    client.download_from_selection(studyInstanceUID= \
         "1.3.6.1.4.1.14519.5.2.1.6279.6001.175012972118199124641098335511", \
         downloadDir="/some/dir")
                                   
    client.download_from_selection(seriesInstanceUID=\
         "1.3.6.1.4.1.14519.5.2.1.6279.6001.141365756818074696859567662357", \
         downloadDir="/some/dir")
                                   
    

    Jiang, P., Sinha, S., Aldape, K., Hannenhalli, S., Sahinalp, C. & Ruppin, E. Big data in basic and translational cancer research. Nat. Rev. Cancer 22, 625–639 (2022). http://dx.doi.org/10.1038/s41568-022-00502-0

  • Schapiro, D., Yapp, C., Sokolov, A., Reynolds, S. M., Chen, Y.-A., Sudar, D., Xie, Y., Muhlich, J., Arias-Camison, R., Arena, S., Taylor, A. J., Nikolov, M., Tyler, M., Lin, J.-R., Burlingame, E. A., Human Tumor Atlas Network, Chang, Y. H., Farhi, S. L., Thorsson, V., Venkatamohan, N., Drewes, J. L., Pe’er, D., Gutman, D. A., Herrmann, M. D., Gehlenborg, N., Bankhead, P., Roland, J. T., Herndon, J. M., Snyder, M. P., Angelo, M., Nolan, G., Swedlow, J. R., Schultz, N., Merrick, D. T., Mazzili, S. A., Cerami, E., Rodig, S. J., Santagata, S. & Sorger, P. K. MITI minimum information guidelines for highly multiplexed tissue images. Nat. Methods 19, 262–267 (2022). http://dx.doi.org/10.1038/s41592-022-01415-4

  • Wahid, K. A., Glerean, E., Sahlsten, J., Jaskari, J., Kaski, K., Naser, M. A., He, R., Mohamed, A. S. R. & Fuller, C. D. Artificial intelligence for radiation oncology applications using public datasets. Semin. Radiat. Oncol. 32, 400–414 (2022). http://dx.doi.org/10.1016/j.semradonc.2022.06.009

  • Hartley, M., Kleywegt, G. J., Patwardhan, A., Sarkans, U., Swedlow, J. R. & Brazma, A. The BioImage Archive - Building a Home for Life-Sciences Microscopy Data. J. Mol. Biol. 167505 (2022). doi:10.1016/j.jmb.2022.167505 http://dx.doi.org/10.1016/j.jmb.2022.167505

  • Diaz-Pinto, A., Alle, S., Nath, V., Tang, Y., Ihsani, A., Asad, M., Pérez-García, F., Mehta, P., Li, W., Flores, M., Roth, H. R., Vercauteren, T., Xu, D., Dogra, P., Ourselin, S., Feng, A. & Cardoso, M. J. MONAI Label: A framework for AI-assisted interactive labeling of 3D medical images. arXiv [cs.HC] (2022). at <http://arxiv.org/abs/2203.12362>

  • https://doi.org/10.1111/joim.20053
    https://doi.org/10.21203/rs.3.rs-4351526/v1
    http://dx.doi.org/10.1158/0008-5472.CAN-21-0950
    http://dx.doi.org/10.1038/s41467-023-37224-2
    http://dx.doi.org/10.1007/s10278-022-00683-y
    10.1016/j.cmpb.2023.107839
    http://arxiv.org/abs/2306.00150
    http://dx.doi.org/10.1038/s41467-024-51202-2
    https://www.nature.com/articles/s41597-023-02864-y
    http://arxiv.org/abs/2310.14897
    here
    https://www.nature.com/articles/s42256-024-00807-9
    http://arxiv.org/abs/2310.14897
    http://arxiv.org/abs/2305.07637

    Once you get the SU’s you can access JetStream interface to configure and create VMs here: https://jetstream2.exosphere.app/ (you can lean more about available configurations from this documentation page: https://docs.jetstream-cloud.org/general/vmsizes/).

  • Once you created a VM and your setup is complete, it’s very easy to connect to your VMs through ssh or Web desktop interface.

  • it takes about 1 hour to build Slicer application from scratch on a medium-sized VM using 7 threads

  • it took ~7 days and ~5000 SUs to train the model in this repository (see summary in the slides here) using g3.large VM configuration

  • Geared to help you save! Unlike the VMs you get from the commercial providers, JetStream VMs can be shelved. Once a VM is shelved, you spend zero SUs for keeping it around (in comparison, you will keep paying for the disk storage of your GCP VMs even when they are turned off).

  • Customer support is excellent! We received responses within 1-2 days. On some occasions, we observed glitches with Web Desktop, but those could often be resolved by restarting the VM.

  • https://docs.jetstream-cloud.org/alloc/overview/
    JetStream2
    https://docs.jetstream-cloud.org/general/access/
    AppStream2
    https://jetstream-cloud.org/research/index.html#cite-jetstream
    https://access-ci.org/about/acknowledging-access/
    https://docs.jetstream-cloud.org/general/vmsizes/#jetstream2-gpu
    Although IDC data is stored in DICOM format, it can be converted into alternative research representations using open-source tools.
    GA4GH Data Repository Service API
    DRS ID
    NCI CRDC Data Commons Framework
    DrsObject
    export the s5cmd manifest fist
    "Getting started" tutorial series
    Getting started with GCP
    here
    dicom_all
    this documentation article
    BigQuery console
    BigQuery console
    s5cmd
    https://github.com/peak/s5cmd#installation
    Getting started with GCP
    http://localhost:6080/vnc.html?autoconnect=true
    http://localhost:6080/vnc.html?autoconnect=true
    in this document
    AWS bucket example
    s5cmd --no-sign-request ls s3://idc-open-data/e127d258-37c2-47bb-a7d1-1faa7f47f47a/5dce0cf0-4694-4dff-8f9e-2785bf179267.dcm
    2023-04-09 11:49:55    3308170 5dce0cf0-4694-4dff-8f9e-2785bf179267.dcm
    GCS bucket example
    s5cmd --no-sign-request --endpoint-url https://storage.googleapis.com ls s3://public-datasets-idc/e127d258-37c2-47bb-a7d1-1faa7f47f47a/5dce0cf0-4694-4dff-8f9e-2785bf179267.dcm
       3308170  2023-04-01T01:21:31Z  gs://public-datasets-idc/e127d258-37c2-47bb-a7d1-1faa7f47f47a/5dce0cf0-4694-4dff-8f9e-2785bf179267.dcm
    TOTAL: 1 objects, 3308402 bytes (3.16 MiB)
    s5cmd --no-sign-request --endpoint-url https://storage.googleapis.com ls s3://public-datasets-idc/ee34c840-b0ca-4400-a6c8-c605cef17630/
    2023/04/01 03:00:34           1719696 18c206a6-2db4-45cd-89a2-e83273a38f42.dcm
    2023/04/01 03:00:36           3308402 21e5e9ce-01f5-4b9b-9899-a2cbb979b542.dcm
    2023/04/01 01:50:29          29477804 3cfc3da3-8389-49f6-a6ee-6ba6406f639e.dcm
    2023/04/01 01:50:27         214715792 428590a0-816c-4041-a3ae-676a68411794.dcm
    2023/04/01 03:00:30           2301902 57ff4432-c29d-4ccf-964c-0b421302add3.dcm
    2023/04/01 03:00:33           3540080 77ff406a-a236-4846-83dd-ae3bd7a6bc71.dcm
    {
      "access_methods": [
        {
          "access_id": "gs",
          "access_url": {
            "url": "gs://public-datasets-idc/cc9c8541-949d-48d9-beaf-7028aa4906dc/641121f1-5ca0-42cc-9156-fb5538c14355.dcm"
          },
          "region": "",
          "type": "gs"
        },
        {
          "access_id": "s3",
          "access_url": {
            "url": "s3://idc-open-data/cc9c8541-949d-48d9-beaf-7028aa4906dc/641121f1-5ca0-42cc-9156-fb5538c14355.dcm"
          },
          "region": "",
          "type": "s3"
        }
      ],
      "aliases": [],
      "checksums": [
        {
          "checksum": "f338e8c5e3d8955d222a04d5f3f6e2b4",
          "type": "md5"
        }
      ],
      "created_time": "2020-06-01T00:00:00",
      "description": "DICOM instance",
      "form": "object",
      "id": "dg.4DFC/641121f1-5ca0-42cc-9156-fb5538c14355",
      "index_created_time": "2023-06-26T18:27:45.810110",
      "index_updated_time": "2023-06-26T18:27:45.810110",
      "mime_type": "application/json",
      "name": "1.3.6.1.4.1.14519.5.2.1.7695.1700.277743171070833720282648319465",
      "self_uri": "drs://dg.4DFC:641121f1-5ca0-42cc-9156-fb5538c14355",
      "size": 135450,
      "updated_time": "2020-06-01T00:00:00",
      "version": "IDC version: 1"
    }
    # Select all files for a given PatientID
    SELECT DISTINCT(CONCAT(series_aws_url, "* .")) 
    FROM `bigquery-public-data.idc_current.dicom_all`
    WHERE PatientID = "LUNG1-001"
    # Select all files for a given collection
    SELECT DISTINCT(CONCAT(series_aws_url, "* .")) 
    FROM `bigquery-public-data.idc_current.dicom_all`
    WHERE collection_id = "nsclc_radiomics"
    # Select all files for a given DICOM series
    SELECT DISTINCT(CONCAT(series_aws_url, "* .")) 
    FROM `bigquery-public-data.idc_current.dicom_all`
    WHERE SeriesInstanceUID = "1.3.6.1.4.1.32722.99.99.298991776521342375010861296712563382046"
    # Select all files for a given DICOM study
    SELECT DISTINCT(CONCAT(series_aws_url, "* .")) 
    FROM `bigquery-public-data.idc_current.dicom_all`
    WHERE StudyInstanceUID = "1.3.6.1.4.1.32722.99.99.239341353911714368772597187099978969331"
    SELECT DISTINCT(CONCAT(series_gcp_url, "* ."))
    bq query --use_legacy_sql=false --format=csv --max_rows=20000000 < query.txt > manifest.txt
    # count the number of rows
    SELECT COUNT(DISTINCT(crdc_series_uuid)) 
    FROM bigquery-public-data.idc_current.dicom_all 
    WHERE collection_id = "nsclc_radiomics"
    # calculate the disk size in GB needed for the files to be downloaded
    SELECT ROUND(SUM(instance_size)/POW(1024,3),2) as size_GB 
    FROM bigquery-public-data.idc_current.dicom_all 
    WHERE collection_id = "nsclc_radiomics"
    s5cmd --no-sign-request --endpoint-url https://storage.googleapis.com cp s3://public-datasets-idc/cdac3f73-4fc9-4e0d-913b-b64aa3100977/902b4588-6f10-4342-9c80-f1054e67ee83.dcm .
    s5cmd --no-sign-request --endpoint-url=https://s3.amazonaws.com run manifest_file_name
    s5cmd --no-sign-request --endpoint-url https://storage.googleapis.com run manifest_file_name
    export VMNAME=vm-name
    gcloud compute instances create ${VMNAME} \
      --machine-type=n1-standard-8 \
      --accelerator=type=nvidia-tesla-k80,count=1 \
      --image-family=slicer \
      --image-project=idc-sandbox-000 \
      --boot-disk-size=200GB \
      --boot-disk-type=pd-balanced \
      --maintenance-policy=TERMINATE
    gcloud compute ssh ${VMNAME} -- -L 6080:localhost:6080
    export VMNAME=vm-name
    gcloud compute instances create ${VMNAME} \
      --machine-type=n1-standard-8 \
      --image-family=slicer \
      --image-project=idc-sandbox-000 \
      --boot-disk-size=200GB \
      --boot-disk-type=pd-balanced \
      --maintenance-policy=TERMINATE
    gcloud compute ssh ${VMNAME} -- -L 6080:localhost:6080
    # these are on-time installs
    sudo systemctl stop novnc
    sudo apt-get update
    sudo apt-get -y install tigervnc-standalone-server websockify
    vncserver -xstartup xfce4-session 
    # here you will be prompted for a password for vnc if you haven't already
    sudo systemctl stop novnc
    nohup websockify --web /opt/novnc/noVNC/ 6080 localhost:5901 &

    IDC Portal: data version and release date are displayed in the summary strip

  • idc-index: use get_idc_version()function

    • BigQuery: within bigquery-public-dataproject, idc_currentdataset contains table "views" to effectively provide an alias for the latest IDC data release. To find the actual IDC data release number, expand the list of datasets under bigquery-public-dataproject, and search for the ones that follow the pattern `idc_v<number>`. The one with the largest number corresponds to the latest released version, and will match the content in idc_current (related Google bug here).

    • 3D Slicer / SlicerIDCBrowser: version information is provided in the SlicerIDCBrowser module top panel, and in the pop-up window title.

    Implementation details

    The IDC obtains curated DICOM radiology, pathology and microscopy image and analysis data from The Cancer Imaging Archive (TCIA) and additional sources. Data from all these sources evolves over time as new data is added (common), existing files are corrected (rare), or data is removed (extremely rare).

    Users interact with IDC using one of the following interfaces to define cohorts, and then perform analyses on these cohorts:

    • IDC Portal directly or using IDC API: while this approach is most convenient, it allows searching using a small subset of attributes, defines cohorts only in terms of cases that meet the defined criteria, and has very limited options for combining multiple search criteria

    • IDC BigQuery tables via SQL interface: this approach is most powerful, as it allows the use of any of the DICOM metadata attributes to define the cohort, while leveraging the expressiveness of SQL in defining the selection logic, and allows to define cohort at any level of the data model hierarchy (i.e., instances, series, studies or cases)

    The goal of IDC versioning is to create a series of "snapshots” over time of the entirety of the evolving IDC imaging dataset, such that searching an IDC version according to some criteria (creating a cohort) will always identify exactly the same set of objects. Here “identify” particularly means providing URLs or other access methods to the corresponding physical data objects.

    In order to reproduce the result of such analysis, it must be possible to precisely recreate a cohort. For this purpose an IDC cohort as defined in the Portal is specified and saved as a filter applied against a specified IDC data version. Alternatively, the cohort can be defined as an SQL query, or as a list of unique identifiers selecting specific files within a defined data release version.

    Because an IDC version exactly defines the set of data against which the filter/query is applied, and because all versions of all data, except data removed due to PHI/PII concerns, should continue to be available, a cohort is therefore persistent over the course of the evolution of IDC data.

    There are various reasons that can cause modification of the existing collections in IDC:

    • images for new patients can be added to an existing collections;

    • additional DICOM series are sometimes added to a DICOM study over time (i.e., those that contain new annotations or analysis results);

    • a series may be added or removed from an existing study;

    • metadata of an existing instance might be corrected (which may or may not lead to an update of the DICOM SOPInstanceUID corresponding to the instance).

    These and other possible changes mean that DICOM instances, series and studies can change from one IDC data version to the next, while their DICOM UIDs remain unchanged. This motivates the need for maintaining versioning of the DICOM entities.

    Because DICOM SOPInstanceUIDs, SeriesInstanceUIDs or StudyInstanceUIDs can remain invariant even when the composition of an instance, series or study changes, IDC assigns each version of each instance, series or study a UUID to uniquely identify it and differentiate it from other versions of the same DICOM object.

    It is very important to appreciate the difference between DICOM Unique Identifiers (UIDs) and CRDC Universally Unique Identifiers (UUIDs) assigned at the various levels of the DICOM hierarchy:

    • DICOM UIDs are available as DICOM metadata attributes within the DICOM files for each DICOM Study, Series and Instance. Those UIDs follow the conventions of the DICOM UI Value Representation. DICOM UIDs are not versioned. I.e., if a DICOM study is augmented with a new DICOM series, DICOM StudyInstanceUID will not change. If an instance within an existing DICOM series is modified, DICOM SeriesInstanceUID or the SOPInstanceUID of the modified instance may or may not change.

    • IDC UUIDs are not available as DICOM metadata attributes - they are generated for the DICOM studies, series and instances at the time of data ingestion, and are available in the IDC BigQuery tables. IDC UUIDs are tied to the content of the entity they correspond to. I.e., if anything within a DICOM study/series/instance is changed in a given IDC data release, a new UUID at the corresponding level of data hierarchy will be generated, while the previous version will be indexed and available via the prior UUID.

    The data in each IDC version, then, can be thought of as some set of versioned DICOM instances, series and studies. This set is defined in terms of the corresponding set of instance UUIDs, series UUIDs and study UUIDs. This means that if, e.g., some version of an instance having UUID UUIDx that was in IDC version Vm is changed, a new UUID, UUIDy, will be assigned to the new instance version. Subsequent IDC versions, Vm+1, Vm+2, ... will include that new instance version identified by UUIDy unless and until that instance is again changed. Similarly if the composition of some series changes, either because an instance in the series is changed, or an instance is added or removed from that series, a new UUID is assigned to the new version of that series and identifies that version of the series in subsequent IDC versions. Similarly, a study is assigned a new UUID when its composition changes.

    A corollary is that only a single version of an instance, series or study is in an IDC version.

    Note that instances, series and studies do not have an explicit version number in their metadata. Versioning of an object is implicit in the associated UUIDs.

    As we will see in Organization of data, the UUID of a (version of an) instance, and the UUID of the (version of a) series to which it belongs, are used in forming the object (file) name of the corresponding GCS and AWS objects. In addition, each instance version has a corresponding GA4GH DRS object, identified by a GUID based on the instance version's UUID. Refer to the GA4GH DRS Objects section for details.

    this page
    Storage Buckets

    Storage Buckets are basic containers in Google Cloud that provide storage for data objects (you can read more about the relevant terms in the Google Cloud Storage documentation here).

    Storage buckets are named using the format idc-tcia-<TCIA_COLLECTION_NAME>, where TCIA_COLLECTION_NAME corresponds to the collection name in the collections table here.

    Within the bucket, DICOM files are organized using the following directory naming conventions:

    dicom/<StudyInstanceUID>/<SeriesInstanceUID>/<SOPInstanceUID>.dcm

    where *InstanceUIDs correspond to the respective value of the DICOM attributes in the stored DICOM files.

    You can read about accessing GCP storage buckets from a Compute VM here.

    Egress of IDC data out of the cloud is free, since IDC data is participating in Google Public Datasets Program!

    Assuming you have a list of GCS URLs in gcs_paths.txt, you can download the corresponding items using the command below, substituting $PROJECT_ID with the valid GCP Project ID (see the complete example in this notebook):

    BigQuery Tables

    Google BigQuery (BQ) is a massively-parallel analytics engine ideal for working with tabular data. Data stored in BQ can be accessed using standard SQL queries.

    IDC utilizes the standard capabilities of the Google Healthcare API to extract all of the DICOM metadata from the hosted collections into a single BQ table. Conventions of how DICOM attributes of various types are converted into BQ form are covered in the Understanding the BigQuery DICOM schema Healthcare API documentation article.

    Due to the existing limitations of Google Healthcare API, not all of the DICOM attributes are extracted and are available in BigQuery tables. Specifically:

    • sequences that have more than 15 levels of nesting are not extracted (see https://cloud.google.com/bigquery/docs/nested-repeated) - we believe this limitation does not affect the data stored in IDC

    • sequences that contain around 1MiB of data are dropped from BigQuery export and RetrieveMetadata output currently. 1MiB is not an exact limit, but it can be used as a rough estimate of whether or not the API will drop the tag (this limitation was not documented as of writing this) - we know that some of the instances in IDC will be affected by this limitation. The fix for this limitation is targeted for sometime in 2021, according to the communication with Google Healthcare support.

    IDC users can access this table to conduct detailed exploration of the metadata content, and build cohorts using fine-grained controls not accessible from the IDC portal.

    In addition to the DICOM metadata tables, we maintain several additional tables that curate metadata non-DICOM metadata (e.g., attribution of a given item to a specific collection and DOI, collection-level metadata, etc).

    • canceridc-data.idc.dicom_metadata: DICOM metadata for all of the data hosted by IDC

    • canceridc-data.idc.data_collections_metadata : collection-level metadata for the original TCIA data collections hosted by IDC, for the most part corresponding to the content available in this table at TCIA

    • ``canceridc-data.idc.analysis_collections_metadata : collection-level metadata for the TCIA analysis collections hosted by IDC, for the most part corresponding to the content available in this table at TCIA

    In addition to the tables above, we provide the following BigQuery views (virtual tables defined by queries) that extract specific subsets of metadata, or combine attributes across different tables, for convenience of the users

    • canceridc-data.idc_views.dicom_all: DICOM metadata together with the collection-level metadata

    • ``canceridc-data.idc_views.segmentations: attributes of the segments stored in DICOM Segmentation object

    • canceridc-data.idc_views.measurement_groups: measurement group sequences extracted from the DICOM SR TID1500 objects

    • : coded evaluation results extracted from the DICOM SR TID1500 objects

    • : quantitative evaluation results extracted from the DICOM SR TID1500 objects

    DICOM Stores

    IDC MVP utilizes a single Google Healthcare DICOM store to host all of the collections. That store, however, is primarily intended to support visualization of the data using OHIF Viewer. At this time, we do not support access of the hosted data via DICOMWeb interface by the IDC users. See more details in the discussion here, and please comment about your use case if you have a need to access data via the DICOMweb interface.

    BigQuery tables external to IDC

    In addition to the DICOM data, some of the image-related data hosted by IDC is stored in additional tables. These include the following:

    • BigQuery TCGA clinical data: isb-cgc:TCGA_bioclin_v0.clinical_v1 . Note that this table is hosted under the ISB-CGC Google project, as documented here, and its location may change in the future!

    Healthcare API
    "getting started" steps
    canceridc-data GCP project

    Files and metadata

    We gratefully acknowledge Google Public Data Program and the AWS Open Data Sponsorship Program that support public hosting of IDC-curated content, and cover out-of-cloud egress fees!

    Let's start with the overall principles of how we organize data in IDC.

    IDC brings you (as of v21) over 85 TB of publicly available DICOM images and image-derived content. We share those with you as DICOM files, and those DICOM files are available in cloud-based storage buckets - both in Google and AWS.

    Sharing just the files, however, is not particularly helpful. With that much data, it is no longer practical to just download all of those files to later sort through them to select those you need.

    Think of IDC as a library, where each file is a book. With that many books, it is not feasible to read them all, or even open each one to understand what is inside. Libraries are of little use without a catalog!

    To provide you with a catalog of our data, along with the files, we maintain metadata that makes it possible to understand what is contained within files, and select the files that are of interest for your project, so that you can download just the files you need.

    In the following we describe organization of both the storage buckets containing the files, and the metadata catalog that you can use to select files that meet your needs. As you go over this documentation, please consider completing our - it will give you the opportunity to apply the knowledge you gain by reading this article while interacting with the data, and should help better understand this content.

    Storage Buckets

    Storage Buckets are basic containers in Google Cloud Storage and AWS S3 that provide storage for data objects (you can read more about the relevant terms in the Google Cloud Storage documentation and in S3 ).

    All IDC DICOM file data for all IDC data versions across all of the are mirrored between Google Cloud Storage (GCS) and AWS S3 buckets.

    Currently all DICOM files are maintained in buckets that allow for free egress within or out of the cloud. This is enabled through the partnership of IDC with and the .

    Data category
    Cloud provider and bucket name

    Within each bucket files are organized in folders, each folder containing files corresponding to a single DICOM series. On ingestion, we assign each DICOM series and each DICOM instance a UUID, in order to be able to support (when needed). These UUIDs are available in our metadata indices, and are used to organized the content of the buckets: for each version of a DICOM instance having instance UUID instance_uuid in a version of a series version having UUID series_uuid, the file name is:

    <series_uuid>/<instance_uuid>.dcm

    Corresponding files have the same object name in GCS and S3, though the name of the containing buckets will be different.

    Metadata

    IDC metadata tables are provided to help you navigate IDC content and narrow down to the specific files that meet your research interests.

    As a step in data ingestion process (summarized ), IDC extracts all of the DICOM metadata, merges it with collection-level and some other metadata attributes not available from DICOM, ingests collection-level clinical tables and stores the result in Google BigQuery tables. Google is a massively-parallel analytics engine ideal for working with tabular data. Data stored in BQ can be accessed using queries. We talk more about those in the subsequent sections of the documentation!

    Searching BigQuery tables requires you to sign in with a Google Account! If this poses a problem for you, there are several alternatives.

    idc-index provides access to the metadata aggregated at the DICOM series level. BigQuery and Parquet files provide metadata at the granularity of individual DICOM instances (files).

    Python idc-index package

    A small subset of most critical metadata attributes available in IDC BigQuery tables is extracted and made available via .

    If you are just starting with IDC, you can skip the details covering the content of BigQuery tables, and proceed to that will help you learn basics of searching IDC metadata using idc-index. But for the sake of example, you would select and download MR DICOM series available in IDC as follows.

    Parquet files available via a cloud bucket

    We export all the content available via BigQuery into Parquet () files available from our public AWS bucket! Using open-source tools such as DuckDB () you can query those files using SQL queries, without relying on BigQuery (although, running complex queries may require significant resources from your runtime environment!).

    The exported Parquet files are located in the IDC-maintained AWS idc-open-metadata bucket, which is updated every time IDC has a new data release. The exported tables are organized under the folder bigquery_export in that bucket, with each sub-folder corresponding to a BigQuery dataset.

    Assuming you have s5cmd installed, you can list the exported datasets as follows.

    As an example, the dicom_all table for the latest (current) IDC release will be in s3://idc-open-metadata/bigquery_export/idc_current/dicom_all (since the table is quite large, the export result is not a single file, but a folder containing thousands of Parquet files.

    You can query those tables/parquet files without downloading them, as shown in the following snippet. Depending on the query you are trying to execute, you may need a lot of patience!

    Visualizing images

    IDC integrates two different viewers, which will be used depending on the type of images being opened. Visualization of radiology images uses the open-source Open Health Imaging Foundation (OHIF) Viewer v3. The SliM Viewer is used for visualization of pathology and slide microscopy images. We customized both of those viewers slightly to add features specific to IDC. You can find all of those modifications in the respective forks under the IDC GitHub organization for OHIF and SliM viewers: OHIF Viewer fork and SliM Viewer fork. IDC Viewer is opened every time you click the "eye" icon in the study or series table of the IDC Portal.

    The OHIF and SliM viewers do not support 32 bit browsers.

    IDC Viewer is a "zero-footprint" client-side viewer: before you can see the image in the viewer, it has to be downloaded to your browser from the IDC DICOM stores. IDC Viewer communicates the data it receives through a proxy via the DICOMweb interface implemented in GCP Cloud Healthcare API.

    Currently, IDC Viewer proxy limits the amount of data that can be downloaded in one day to 137 GB per IP address, and enforces a total quota per day over all of the IP addresses. If the quota is exhausted, you will not be able to see any images in IDC Viewer until the limit is reset and instead will be redirected to ! We may adjust the current proxy limits in the future, and you are welcome to provide your feedback on the appropriateness of the current quota in .

    IDC radiology viewer functionality

    The main functions of the viewer are available via the toolbar controls shown below.

    The functionality supported by those tools should be self-explanatory, or can be discovered via quick experimentation.

    If you want to report a problem related to visualization of a specific study in the IDC Viewer, please use the "Debug Info" tool to collect debugging information. Please report the issue on the , including the entire content of the debugging information to help us investigate the issue.

    Visualizing radiology annotations

    IDC Viewer supports visualization of annotations stored as DICOM Segmentation objects (SEG), DICOM Radiotherapy Structure Sets (RTSTRUCT), and certain annotations stored in DICOM TID1500 Structured Reports. When available in a given study, you will see those modalities labeled as such in the left-hand panel of the viewer, as shown below. To load, double-click on the corresponding thumbnail in the series list in the left panel. After that you can open the navigation panel in the upper right corner to jump to the locations of the specific structure sets or segments, and to control their individual visibility.

    Note that certain modalities, such as Segmentation (SEG) and Real World Value Mapping (RWVM) objects, cannot be selected for visualization from the IDC Portal. SEG can only be viewed in the context of the image series segmented, and RWVM series are not viewable and will not show up in the left panel of the viewer.

    Below is an example of series objects that are not viewable at the series level.

    IDC pathology viewer functionality

    The IDC pathology viewer allows for interactive visualization of digital slide microscopy (SM) images. Left panel will show all digital slides available in a given study. Click on the thumbnail to open a specific slide. Right panel will summarize the information about slide image channels, and will list annotations, analysis results, and presentation states when available.

    IDC viewer support visualization of DICOM Segmentations (binary and fractional), Parametric Maps, planar annotations stored as DICOM TID1500 Structured Reports (SR modality) or bulk annotations (ANN modality).

    Visualizing slide microscopy annotations

    Whenever annotations or segmentations are available for the slide you opened, you will see the corresponding sections populated in the bottom-right portion of the window. Expand those to see what is available and to toggle visualization.

    Note that sometime only small regions of the image are annotated. You can double-click on the specific annotation group in the "Annotation Groups" section to zoom into the section of the image where that group is located!

    Configuring the IDC Viewer URL

    You can use IDC Viewer to visualize any of the suitable data in IDC. To configure the IDC Viewer URL, simply append StudyInstanceUID of a study available in IDC to the following prefix: (for the radiology viewer) and/ (for the digital pathology viewer). This will open the entire study in the viewer. You can also configure the URL to open specific series of the study, as defined by the list of SeriesInstanceUID items. When you open the IDC Viewer from the IDC Portal, the URLs of the pages will be populated following those conventions.

    Here are some specific examples, taken from the IDC Portal dashboard:

    • open entire study with the StudyInstanceUID1.3.6.1.4.1.14519.5.2.1.6279.6001.224985459390356936417021464571: .

    • open the specified subset of series from the study above:

    Digital pathology viewer uses a slightly different convention, as should be evident from this example URL:

    Deploying your own viewer

    You can share the viewer URLs if you want to refer to visualizations of the specific items from IDC. You can also use this functionality if you want to visualize specific items from your notebook or a custom dashboard (e.g., a Google DataStudio dashboard).

    If you want to visualize your own images, or if you would like to combine IDC images with the analysis results or annotations you generated, you do have several options:

    • You can use Google FireCloud to deploy v2 radiology or microscopy viewers as web applications, without having to use virtual machines or docker, and for free!

    • If you want to visualize images inside a Colab/Jupyter notebook - you can use

    Security considerations

    Computing on the Cloud

    Most of the same linux commands, scripts, pipelines/workflows, imaging software packages and docker containers that you run on your local machine can be executed on virtual machines on Google Cloud with experimentation and fine tuning.

    1. The basics and best practices on how to launch virtual machines (VMs) are described here in our documentation. NOTE: When launching VMs, please maintain the default firewall settings.

    2. Compute Engine instances can run the public images for Linux and Windows Server that Google provides as well as private custom images that you can or . Be careful as you spin up a machine, as larger machines cost you more. If you are not using a machine, shut it down. You can always restart it easily when you need it. Example use-case: You would like to run Windows-only genomics software package on the TCGA data. You can create a Windows based VM instance.

    3. More details on how to deploy docker containers on VMs are described here in Google’s documentation:

    4. A good way to estimate costs for running a workflow/pipeline on large data sets is to test them first on a small subset of data.

    5. There are different VM types depending on the sort of jobs you wish to execute. By default, when you create a VM instance, it remains active until you either stop it or delete it. The costs associated with VM instances are detailed here:

    6. If you plan on running many short compute-intensive jobs (for example indexing and sorting thousands of large bam files), you can execute your jobs on preemptible virtual machines. They are 80% cheaper than regular instances.

    Example use-cases:

    • Using preemptible VMs, researchers were able to quantify transcript levels on over 11K TGCA RNAseq samples for a total cost of $1,065.49. Tatlow PJ, Piccolo SR. . Scientific Reports 6, 39259

    • Also Broad’s popular variant caller pipeline, GATK, was designed to be able to run on preemptible VMs.

    • Google cloud computing can be estimated .

    Be Very Careful with Tokens containing passwords. They should NOT be moved to Github

    Because of the ability to see a of Github postings, if a password or bearer token is part of software code (e.g. notebook or colaboratory) it will be permanently available on Github. This is a security risk!! Do not put bearer tokens or other passwords into workbooks, instead refer to them in the code and place those in a location not posted into Github (if you do post it to GitHub, it then immediately becomes public, usable, and able to be stolen and used maliciously by others). If you do accidentally post one to Github: 1) immediately change passwords on your systems to remove the exposure provided by the exposed password, 2) let those who involved in the security of your system and data know, and 3) remedy your code-base so future saves to Github do not include passwords or tokens in your codebase.

    Storage on the Cloud

    The Google Cloud Platform offers a number of different storage options for your virtual machine instances:

    • By default, each virtual machine instance has a single boot persistent disk that contains the operating system. The default size is 10GB but can be adjusted up to 64TB in size. (Be careful! High costs here, spend wisely!)

    • Persistent disks are restricted to the zone where your instance is located.

    • Use persistent disks if you are running analyses that require low latency and high-throughput.

    1. Google Cloud Storage (GCS) buckets are the most flexible and economical storage option.

    • Unlike persistent disks, Cloud Storage buckets are not restricted to the zone where your instance is located.

    • Additionally, you can read and write data to a bucket from multiple instances simultaneously.

    • You can mount a GCS bucket to your VM instance when latency is not a priority or when you need to share data easily between multiple instances or zones. An example use-case: You want to slice thousands of bam files and save the resulting slices to share with a collaborator who has instances in another zone to use for downstream statistical analyses.

    You can save objects to GCS buckets including images, videos, blobs and unstructured data. A comparison table detailing the current pricing of Google’s storage options can be found here:

    Clinical data

    Check out our tutorial for a brief hands-on introduction into IDC clinical data! You can also see the high-level summary of the clinical data attributes accompanying IDC data in .

    Background

    By clinical data we refer to the broad spectrum of image-related data that may accompany images. Such data may include demographics of the patients, observations related to their clinical history (therapies, diagnoses, findings), lab tests, surgeries.

    Clinical data

    Check out our tutorial for a brief hands-on introduction into IDC clinical data!

    Background

    By clinical data we refer to the broad spectrum of image-related data that may accompany images. Such data may include demographics of the patients, observations related to their clinical history (therapies, diagnoses, findings), lab tests, surgeries.

    Using QuPath for visualization

    is a popular open-source desktop application for visualizing and annotating slide microscopy images. It is integrated with both OpenSlide and BioFormats libraries, and as of the current QuPath 0.5.1 version supports direct loading of DICOM Slide Microscopy images. In this tutorial you will learn how to use DICOM SM images from IDC with QuPath.

    Load a brightfield (RGB) DICOM slide

    First you will need to download a sample SM image from IDC to your desktop. To identify a sample image, you can navigate to the IDC Portal, copy SeriesInstanceUID value for a sample SM series you want to download. Given that UID, you can download the corresponding files using idc-index python package (see details in the documentation section describing data d).

    get

    Get links to the IDC API Swagger UI and IDC documentation

    Responses
    200

    API Description and link to SwaggerUI interface.

    500

    Server error message

    get
    /about
    get

    Returns a list of IDC data versions and activation dates

    Responses
    200

    A list of IDC data versions and metadata

    500

    Server error message

    get
    /versions
    get

    Returns a list of collections, and associated metadata, in the current IDC data version.

    Responses
    200

    A list of collections

    500

    Server error message

    get
    /collections
    get

    Returns a list of the analysis results, and associated metadata, in the current IDC data version

    Responses
    200

    A list of analysis results

    500

    Server error message

    get
    /analysis_results
    get

    Returns a list of 'filters', organized by data source (BQ table), for the current IDC data version. An IDC cohort is defined by a 'filterset', a set of (filter,[values]) pairs, and the IDC version against which the filterset is applied. The returned data is grouped by source (the BQ table that contains the corresponding filter values). For each filter, its data type and units, when available, are reported.

    Responses
    200

    A list of filters

    500

    Server error message

    get
    /filters
    get

    Return a list of the values accepted for a 'categorical filter'. A categorical filter is a filter having a data type of 'Categorical String' or 'Categorical Number'.

    Path parameters
    filterstringRequired

    Categorical filter whose values are to be returned

    Responses
    200

    A list of accepted values

    500

    Server error message

    get
    /filters/values/{filter}
    get

    Return a list of queryable manifest fields.

    Path parameters
    versionstringRequired

    IDC data version whose data is to be returned. If the version is 'current', the fields of the current IDC version are returned.

    Responses
    200

    A list of fields

    500

    Server error message

    get
    /fields/{version}
    post

    Returns a manifest of a 'previewed cohort' as defined by a specified filterset. The filterset is always applied to the current IDC version. The metadata to be returned in the manifest is configurable. A previewed cohort is not saved in the user's IDC account.

    Body
    get

    Returns the next page of a /cohorts/manifest/preview request, when additional data is available.

    Query parameters
    next_pagestringRequired

    The next_page token returned by a previous access of the /cohorts/manifest/preview endpoint. The token identifies the next page to be retrieved

    page_sizeintegerOptional

    The maximum number of rows to be returned. If the manifest contains additional rows, another 'next_page' token is returned.

    Responses
    200

    Preview manifest page

    202

    Timeout waiting for BQ job to complete

    500

    Server error message

    from idc_index import IDCClient
    
    idc_version = IDCClient.get_idc_version()
    $ cat gcs_paths.txt | gsutil -u $PROJECT_ID -m cp -I .
    SELECT
      all_of_idc.*
    FROM
      `canceridc-user-data.user_manifests.manifest_cohort_101_20210127_213746` AS my_cohort
    JOIN
      `bigquery-public-data.idc_current.dicom_all` AS all_of_idc
    ON
      all_of_idc.SOPInstanceUID = my_cohort.SOPInstanceUID
    canceridc-data.idc_views.qualitative_measurements
    canceridc-data.idc_views.quantitative_measurements
    create
    import from your existing systems
    deploying containers
    compute pricing
    preemptible vms
    A cloud-based workflow to quantify transcript-expression levels in public cancer compendia
    here
    history
    disks
    Block Storage:
    Object Storage:
    storage features

    Clinical data is often critical in understanding imaging data, and is essential for the development and validation of imaging biomarkers. However, such data is most often stored in spreadsheets that follow conventions specific to the site that collected the data, may not be accompanied by the dictionary defining the terms used in describing clinical data, and is rarely harmonized. This can be observed on the example of various collections ingested into IDC from The Cancer Imaging Archive (TCIA), such as the ACRIN 6698 collection.

    Not only are the terms used in the clinical data accompanying individual collection not harmonized, but the format of the spreadsheets is also collection-specific. In order to search and navigate clinical data, one has to parse those collection specific tables, and there is no interface to support searching across collections.

    With the release v11 of IDC, we make the attempt to lower the barriers for accessing clinical data accompanying IDC imaging collections. We parse collection-specific tables, and organize the underlying data into BigQuery tables that can be accessed using standard SQL queries. You can also see the summary of clinical data available for IDC collections in this dashboard.

    Clinical data BigQuery tables

    As of Version 11 IDC provides a public BigQuery dataset with clinical data associated with several of its imaging collections. The clinical data tables associated with a particular version are in the dataset bigquery-public-data.idc_<idc_version_number>_clinical. In addition the dataset bigquery-public-data.idc_current_clinical has an identically named view for each table in the BQ clinical dataset corresponding to the current IDC release.

    There are currently 130 tables with clinical data representing 70 different collections. Most of this data was curated from Excel and CSV files downloaded from The Cancer Imaging Archive (TCIA) wiki. For most collections, data is placed in a single table named <collection_id>_clinical, where <collection_id> is the name of the collection in a standardized format (i.e. the idc_webapp_collection_id column in the dicom_all view in the idc_current dataset).

    Collections from the ACRIN project have different types of clinical data spread across CSV files, and so this data is represented by several BigQuery tables. The clinical data for collections in the CPTAC program program is not curated from TCIA but instead is copied from a BigQuery table in the ISB-CGC project, which in turn was sourced from the Genomics Data Commons (GDC) api. Similarly clinical data for collections in the TCGA program is copied from the table tcga_clinical_rel9 in the idc_current dataset, which was also created using the Genomics Data Commons (GDC) api. Every clinical data table contains two fields we have introduced, dicom_patient_id and source_batch. dicom_patient_id is identical to the PatientID field in the DICOM files that correspond to the given patient. The dicom_patient_id value is determined by inspecting the patient column in the clinical data file. In some of the collections' clinical data, the patients are separated into different 'batches' i.e. different source files, or different sheets in the same Excel file. The source_batch field is an integer indicating the 'batch' for the given patient. For most collections, in which all patients data is found in the same location, the source_batch value is zero.

    Most of the clinical tables are legible by themselves. Tables from the ACRIN collection are an exception as the column names and some of the column values are coded. To provide for clarity and ease of use of all clinical data, we have created two metadata tables, table_metadata and column_metadata that provide information about the structure and provenance of all data in this dataset. table_metadata has table-level metadata about each clinical collection, while column_metadata has column-level metadata.

    Structure of thetable_metadata table:

    • collection_id (STRING, NULLABLE) - the collection_id of the collection in the given table. The collection id is in a format used internally by the IDC Web App (with only lowercase letters, numbers and '_' allowed). It is equivalent to the idc_webapp_id field in the dicom_all view in the idc_current dataset.

    • table_name (STRING,NULLABLE) - name of the table

    • table_description (STRING,NULLABLE) - description of the type of data found in the table. Usually this is set to 'clinical data', unless a description is provided in the source files

    • idc_version_table_added (STRING, NULLABLE) - the IDC data version for which this table was first added

    • idc_table_added_datetime (STRING,NULLABLE) - the date/time this particular table was first generated

    • post_process_src (STRING, NULLABLE) - except for the CPTAC and TCGA collections the tables are curated from ZIP, Excel, and CSV files downloaded from the TCIA wiki. These files do not have a consistent structure and were not meant to be machine readable or to translate directly into BigQuery. A semi-manual curation process results in either a CSV of JSON file that can be directly written into a BigQuery table. post_process_src is the name of the JSON or CSV file that results from this process and is used to create the BigQuery table. This field is not used for the CPTAC- and TCGA-related tables

    • post_process_src_add_md5 (STRING, NULLABLE) - the md5 hash of post_process_src when the table was first added

    • idc_version_table_prior (STRING, NULLABLE) - the idc version the second most recent time the table was updated

    • post_process_src_prior_md5 (STRING, NULLABLE) - the md5 hash of post_process_src the second most recent time the table was updated

    • idc_version_table_updated (STRING, NULLABLE) - the idc version when the table was last updated

    • table_update_datetime (STRING, NULLABLE) - date and time an update of the table was last recorded

    • post_process_src_updated_md5 (STRING, NULLABLE) - the md5 hash of post_process_source when the table was last updated

    • number_batches (INTEGER, NULLABLE) - records the number of batches. Within the source data patients are sometimes grouped into different 'batches' (i.e. training vs test, responder vs non-responder etc.) and the batches are placed in different locations (i.e. different files or different sheets in the same Excel file)

    • source_info (RECORD, REPEATED) - an array of records with information about the table sources. These sources are either files downloaded from the TCIA wiki or another BigQuery table (as is the case for CPTAC and TCGA collections). There is a source_info record for each source 'batch' described above

    • source_info.srcs (STRING, REPEATED) - a source file downloaded from the TCIA wiki may be a ZIP file, and CSV file, or an Excel file. Sometimes the ZIP files contain other ZIP files that must be opened to extract the clinical data. In the source_info.src array the first string is the file that is downloaded from TCIA for this particular source batch. The final string is the CSV or Excel file that contains the clinical data. Any intermediate strings are the names of ZIP files 'in between' the downloaded file and the clinical file. For CPTAC and TCGA collections this field contains the source BigQuery table

    • source_info.md5 (STRING, NULLABLE) - md5 hash of the downloaded file from TCIA the most recent time the table was updated

    • source_info.table_last_modified (STRING, NULLABLE) - CPTAC and TCGA collections only. The date and time the source BigQuery table was most recently modified, as recorded when last copied

    • source_info.table_size (STRING, NULLABLE) - CPTAC and TCGA collections only. The size of the source BigQuery table as recorded when last copied

    Structure of column_metadata table:

    • collection_id (STRING,NULLABLE) - the collection_id of the collection in the given table. The collection id is in a format used internally by the IDC Web App (with only lowercase letters, numbers and '_' allowed). It is equivalent to the idc_webapp_id field in the dicom_all view in the idc_current dataset.

    • case_col (BOOLEAN, NULLABLE) - true if the BigQuery column contains the patient or case id, i.e. if this column is used to determine the value of the dicom_patient_id column

    • table_name (STRING, NULLABLE) - table name

    • column (STRING, NULLABLE) - the actual column name in the table. For ACRIN collections the column_name is the variable_name from the provided data dictionary. For other collections it is a name constructed by 'normalizing' the column_label (see next) in a format that can be used as a BigQuery field name

    • column_label (STRING, NULLABLE) - a 'free form' label for the column that does not need to conform to the BigQuery column format requirements. For ACRIN collections this is the variable_label given by a data dictionary that accompanies the collection. For other collections it is the name or label of the clinical attribute as inferred from the source document during the curation process

    • data_type (STRING, NULLABLE) - the type of data in this column. Again for ACRIN collections this is provided in the data dictionary. For other collections it is inferred by analyzing the data during curation

    • original_column_headers (STRING, REPEATED) - the name(s) or label(s) in the source document that were used to construct the column_label field. In most cases there is one column label in the source document that perscribes the column_label. In some cases, multiple columns are concantenated and reformated to form the column_label

    • values (RECORD, REPEATED) - a structure that is borrowed from the ACRIN data model. This is an array that contains observerd attribute values for this given column. For ACRIN collections these values are reported in the data dictionary. For most other collections these values are determined by analyzing the source data. For simplicity this field is left blank when the number of unique values is greater than 20

    • values.option_code (STRING, NULLABLE) - a unique attribute value found in this column

    • values.option_description (STRING, NULLABLE) - a description of the option_code as provided by a data dictionary. For collections that do not have a data dictionary this is null.

    • values_source (STRING, NULLABLE) - indicates the source of the values records. The text 'provided dictionary' indicates that the records were obtained from a provided data dictionary. The text 'derived from inspection of values' indicates that the records were determined by automated analysis of the source materials during the ETL process that generated the BigQuery tables.

    • files (STRING, REPEATED) - names of the files that contain the source data for each batch. These are the Excel or CSV files directly downloaded from TCIA, or the files extracted from downloaded ZIP files

    • sheet_names (STRING, REPEATED) - for Excel-sourced files, the sheet names containing this column's values for each batch

    • batch (INTEGER, REPEATED) - source batches that contain this particular column. Some columns or attributes may be missing from some batches

    • column_numbers (STRING, REPEATED) - for each source batch, the column in the original source corresponding to this column in the BigQuery table

    IDC clinical data exploration Colab notebook
    this dashboard

    Clinical data is often critical in understanding imaging data, and is essential for the development and validation of imaging biomarkers. However, such data is most often stored in spreadsheets that follow conventions specific to the site that collected the data, may not be accompanied by the dictionary defining the terms used in describing clinical data, and is rarely harmonized. This can be observed on the example of various collections ingested into IDC from The Cancer Imaging Archive (TCIA), such as the ACRIN 6698 collection.

    Not only the terms used in the clinical data accompanying individual collection are not harmonized, but the format of the spreadsheets is also collection-specific. In order to search and navigate clinical data, one has to parse those collection specific tables, and there is no interface to support searching across collections.

    With the release v11 of IDC, we make the attempt to lower the barriers for accessing clinical data accompanying IDC imaging collections. We parse collection-specific tables, and organize the underlying data into BigQuery tables that can be accessed using standard SQL queries. You can also see the summary of clinical data available for IDC collections in this dashboard.

    Clinical data BigQuery tables

    As of Version 11, IDC has provided a public BigQuery dataset with clinical data associated with several of its imaging collections. The clinical data tables associated with a particular version are in the dataset bigquery-public-data.idc_<idc_version_number>_clinical. In addition the dataset bigquery-public-data.idc_current_clinical has an identically named view for each table in the BQ clinical dataset corresponding to the current IDC release.

    There are currently 130 tables with clinical data representing 70 different collections. Most of this data was curated from Excel and CSV files downloaded from The Cancer Imaging Archive (TCIA) wiki. For most collections data is placed in a single table named <collection_id>_clinical, where <collection_id> is the name of the collection in a standardized format (i.e. the idc_webapp_collection_id column in the dicom_all view in the idc_current dataset).

    Collections from the ACRIN project have different types of clinical data spread across CSV files, and so this data is represented by several BigQuery tables. The clinical data for collections in the CPTAC program program is not curated from TCIA but instead is copied from a BigQuery table in the ISB-CGC project, which in turn was sourced from the Genomics Data Commons (GDC) api. Similarly clinical data for collections in the TCGA program is copied from the table tcga_clinical_rel9 in the idc_current dataset, which was also created using the Genomics Data Commons (GDC) api. Every clinical data table contains two fields we have introduced, dicom_patient_id and source_batch. dicom_patient_id is identical to the PatientID field in the DICOM files that correspond to the given patient. The dicom_patient_id value is determined by inspecting the patient column in the clinical data file. In some of the collections' clinical data, the patients are separated into different 'batches' i.e. different source files, or different sheets in the same Excel file. The source_batch field is an integer indicating the 'batch' for the given patient. For most collections, in which all patients data is found in the same location, the source_batch value is zero.

    Most of the clinical tables are legible by themselves. Tables from the ACRIN collection are an exception as the column names and some of the column values are coded. To provide for clarity and ease of use of all clinical data, we have created two metadata tables, table_metadata and column_metadata that provide information about the structure and provenance of all data in this dataset. table_metadata has table-level metadata about each clinical collection, while column_metadata has column-level metadata.

    Structure of thetable_metadata table:

    • collection_id (STRING, NULLABLE) - the collection_id of the collection in the given table. The collection id is in a format used internally by the IDC Web App (with only lowercase letters, numbers and '_' allowed). It is equivalent to the idc_webapp_id field in the dicom_all view in the idc_current dataset.

    • table_name (STRING,NULLABLE) - name of the table

    • table_description (STRING,NULLABLE) - description of the type of data found in the table. Usually this is set to 'clinical data', unless a description is provided in the source files

    • idc_version_table_added (STRING, NULLABLE) - the IDC data version for which this table was first added

    • idc_table_added_datetime (STRING,NULLABLE) - the date/time this particular table was first generated

    • post_process_src (STRING, NULLABLE) - except for the CPTAC and TCGA collections the tables are curated from ZIP, Excel, and CSV files downloaded from the TCIA wiki. These files do not have a consistent structure and were not meant to be machine readable or to translate directly into BigQuery. A semi-manual curation process results in either a CSV of JSON file that can be directly written into a BigQuery table. post_process_src is the name of the JSON or CSV file that results from this process and is used to create the BigQuery table. This field is not used for the CPTAC- and TCGA-related tables

    • post_process_src_add_md5 (STRING, NULLABLE) - the md5 hash of post_process_src when the table was first added

    • idc_version_table_prior (STRING, NULLABLE) - the idc version the second most recent time the table was updated

    • post_process_src_prior_md5 (STRING, NULLABLE) - the md5 hash of post_process_src the second most recent time the table was updated

    • idc_version_table_updated (STRING, NULLABLE) - the idc version when the table was last updated

    • table_update_datetime (STRING, NULLABLE) - date and time an update of the table was last recorded

    • post_process_src_updated_md5 (STRING, NULLABLE) - the md5 hash of post_process_source when the table was last updated

    • number_batches (INTEGER, NULLABLE) - records the number of batches. Within the source data patients are sometimes grouped into different 'batches' (i.e. training vs test, responder vs non-responder etc.) and the batches are placed in different locations (i.e. different files or different sheets in the same Excel file)

    • source_info (RECORD, REPEATED) - an array of records with information about the table sources. These sources are either files downloaded from the TCIA wiki or another BigQuery table (as is the case for CPTAC and TCGA collections). There is a source_info record for each source 'batch' described above

    • source_info.srcs (STRING, REPEATED) - a source file downloaded from the TCIA wiki may be a ZIP file, and CSV file, or an Excel file. Sometimes the ZIP files contain other ZIP files that must be opened to extract the clinical data. In the source_info.src array the first string is the file that is downloaded from TCIA for this particular source batch. The final string is the CSV or Excel file that contains the clinical data. Any intermediate strings are the names of ZIP files 'in between' the downloaded file and the clinical file. For CPTAC and TCGA collections this field contains the source BigQuery table

    • source_info.md5 (STRING, NULLABLE) - md5 hash of the downloaded file from TCIA the most recent time the table was updated

    • source_info.table_last_modified (STRING, NULLABLE) - CPTAC and TCGA collections only. The date and time the source BigQuery table was most recently modified, as recorded when last copied

    • source_info.table_size (STRING, NULLABLE) - CPTAC and TCGA collections only. The size of the source BigQuery table as recorded when last copied

    Structure of column_metadata table:

    • collection_id (STRING,NULLABLE) - the collection_id of the collection in the given table. The collection id is in a format used internally by the IDC Web App (with only lowercase letters, numbers and '_' allowed). It is equivalent to the idc_webapp_id field in the dicom_all view in the idc_current dataset.

    • case_col (BOOLEAN, NULLABLE) - true if the BigQuery column contains the patient or case id, i.e. if this column is used to determine the value of the dicom_patient_id column

    • table_name (STRING, NULLABLE) - table name

    • column (STRING, NULLABLE) - the actual column name in the table. For ACRIN collections the column_name is the variable_name from the provided data dictionary. For other collections it is a name constructed by 'normalizing' the column_label (see next) in a format that can be used as a BigQuery field name

    • column_label (STRING, NULLABLE) - a 'free form' label for the column that does not need to conform to the BigQuery column format requirements. For ACRIN collections this is the variable_label given by a data dictionary that accompanies the collection. For other collections it is the name or label of the clinical attribute as inferred from the source document during the curation process

    • data_type (STRING, NULLABLE) - the type of data in this column. Again for ACRIN collections this is provided in the data dictionary. For other collections it is inferred by analyzing the data during curation

    • original_column_headers (STRING, REPEATED) - the name(s) or label(s) in the source document that were used to construct the column_label field. In most cases there is one column label in the source document that perscribes the column_label. In some cases, multiple columns are concantenated and reformated to form the column_label

    • values (RECORD, REPEATED) - a structure that is borrowed from the ACRIN data model. This is an array that contains observerd attribute values for this given column. For ACRIN collections these values are reported in the data dictionary. For most other collections these values are determined by analyzing the source data. For simplicity this field is left blank when the number of unique values is greater than 20

    • values.option_code (STRING, NULLABLE) - a unique attribute value found in this column

    • values.option_description (STRING, NULLABLE) - a description of the option_code as provided by a data dictionary. For collections that do not have a data dictionary this is null.

    • values_source (STRING, NULLABLE) - indicates the source of the values records. The text 'provided dictionary' indicates that the records were obtained from a provided data dictionary. The text 'derived from inspection of values' indicates that the records were determined by automated analysis of the source materials during the ETL process that generated the BigQuery tables.

    • files (STRING, REPEATED) - names of the files that contain the source data for each batch. These are the Excel or CSV files directly downloaded from TCIA, or the files extracted from downloaded ZIP files

    • sheet_names (STRING, REPEATED) - for Excel-sourced files, the sheet names containing this column's values for each batch

    • batch (INTEGER, REPEATED) - source batches that contain this particular column. Some columns or attributes may be missing from some batches

    • column_numbers (STRING, REPEATED) - for each source batch, the column in the original source corresponding to this column in the BigQuery table

    IDC clinical data exploration Colab notebook
    IDC data ingestion workflow

    Data covered by a non-restrictive license (CC-BY or like) and not labeled as such that may contain head scans. This category contains >90% of the data in IDC.

    AWS: idc-open-data GCS: idc-open-data (until IDC v19, we utilized GCS bucket public-datasets-idc before it was superseded by idc-open-data)

    Collections that may contain head scans. This is done for the collections that were labeled as such by TCIA, in case there is a change in policy and we need to treat such images in any special way in the future.

    AWS: idc-open-data-two GCS: idc-open-idc1

    Data that is covered by a license that restricts commercial use (CC-NC). Note that the license information is available programmatically at the granularity of the individual files, as explained in this tutorial - you do not need to check the bucket name to get the license information!

    AWS: idc-open-data-cr GCS: idc-open-cr

    "Getting started" tutorial
    here
    here
    collections hosted by IDC
    Google Public Data Program
    AWS Open Data Sponsorship Program
    data versioning
    earlier
    BigQuery (BQ)
    standard SQL
    idc-index python package
    this tutorial
    https://parquet.apache.org/
    https://duckdb.org/

    API Description and link to SwaggerUI interface.

    countsstringOptional

    If True, return counts of DICOM objects

    Default: False
    group_sizestringOptional

    If True, return size in bytes of instances in group

    Default: False
    sqlstringOptional

    If True, return the BQ SQL for this query.

    Default: False
    page_sizeintegerOptional

    Maximum number of rows to return

    Default: 1000
    Responses
    200

    Preview cohort spec and manifest

    202

    Timeout waiting for BQ job to complete

    500

    Server error message

    post
    /cohorts/manifest/preview
    get
    /cohorts/manifest/preview/nextPage
    pip install --upgrade idc-index
    from idc_index import IDCClient
    
    # instantiate the client
    client = IDCClient()
    
    # define and execute the query
    selection_query = """
    SELECT SeriesInstanceUID
    FROM index
    WHERE Modality = 'MR'
    """
    selection_result = client.sql_query(selection_query)
    
    # download the first series from the list
    client.download_dicom_series(seriesInstanceUID=selection_result["SeriesInstanceUID"].values[0],downloadDir=".")
    $ s5cmd --no-sign-request ls s3://idc-open-metadata/bigquery_export/
                                      DIR  idc_current/
                                      DIR  idc_current_clinical/
                                      DIR  idc_v1/
                                      DIR  idc_v10/
                                      DIR  idc_v11/
                                      DIR  idc_v11_clinical/
                                      DIR  idc_v12/
                                      DIR  idc_v12_clinical/
                                      DIR  idc_v13/
                                      DIR  idc_v13_clinical/
                                      DIR  idc_v14/
                                      DIR  idc_v14_clinical/
                                      DIR  idc_v15/
                                      DIR  idc_v15_clinical/
                                      DIR  idc_v16/
                                      DIR  idc_v16_clinical/
                                      DIR  idc_v17/
                                      DIR  idc_v17_clinical/
                                      DIR  idc_v18/
                                      DIR  idc_v18_clinical/
                                      DIR  idc_v19/
                                      DIR  idc_v19_clinical/
                                      DIR  idc_v2/
                                      DIR  idc_v20/
                                      DIR  idc_v20_clinical/
                                      DIR  idc_v21/
                                      DIR  idc_v21_clinical/
                                      DIR  idc_v3/
                                      DIR  idc_v4/
                                      DIR  idc_v5/
                                      DIR  idc_v6/
                                      DIR  idc_v7/
                                      DIR  idc_v8/
                                      DIR  idc_v9/
    $ s5cmd --no-sign-request ls s3://idc-open-metadata/bigquery_export/idc_current/dicom_all/
    2024/11/23 18:01:07           7545045  000000000000.parquet
    2024/11/23 18:01:07           7687834  000000000001.parquet
    2024/11/23 18:01:07           7409070  000000000002.parquet
    2024/11/23 18:01:07           7527558  000000000003.parquet
    ...
    ...
    2024/11/23 18:00:14           7501451  000000004997.parquet
    2024/11/23 18:00:14           7521972  000000004998.parquet
    2024/11/23 18:00:14           7575037  000000004999.parquet
    2024/09/12 18:20:05            588723  000000005000.parquet
    import duckdb
    
    # Connect to DuckDB (in-memory)
    con = duckdb.connect()
    
    # Install and load the httpfs extension for S3 access
    con.execute("INSTALL httpfs;")
    con.execute("LOAD httpfs;")
    
    # No credentials needed for public buckets
    
    # Query all Parquet files in the public S3 folder
    selection_query = """
    SELECT SeriesInstanceUID
    FROM read_parquet('s3://idc-open-metadata/bigquery_export/idc_current/dicom_all/*.parquet') AS dicom_all
    WHERE Modality = 'MR'
    LIMIT 1
    """
    selection_result = con.execute(selection_query).fetchdf()
    print(selection_result['SeriesInstanceUID'].values[0])
    {
      "message": "text",
      "documentation": "text",
      "code": 1
    }
    {
      "idc_data_version": "text",
      "data_sources": [
        {
          "data_source": "text",
          "filters": [
            {
              "name": "text",
              "data_type": "CONTINUOUS_NUMERIC",
              "units": "text"
            }
          ]
        }
      ],
      "code": 1
    }
    {
      "values": [
        "text"
      ]
    }
    {
      "idc_data_version": "text",
      "data_sources": [
        {
          "data_source": "text",
          "fields": [
            "text"
          ]
        }
      ],
      "code": 1
    }
    POST /cohorts/manifest/preview HTTP/1.1
    Host: 
    Content-Type: application/json
    Accept: */*
    Content-Length: 345
    
    {
      "cohort_def": {
        "name": "mycohort",
        "description": "Example description",
        "filters": {
          "collection_id": [
            "TCGA_luad",
            "tcga_kirc"
          ],
          "Modality": [
            "CT",
            "MR"
          ],
          "Race": [
            "WHITE"
          ],
          "age_at_diagnosis_btw": [
            65,
            75
          ]
        }
      },
      "fields": [
        "Age_At_Diagnosis",
        "aws_bucket",
        "crdc_series_uuid",
        "Modality",
        "SliceThickness"
      ],
      "counts": true,
      "group_size": true,
      "sql": true,
      "page_size": 1000
    }
    {
      "cohort": {
        "name": "text",
        "description": "text",
        "filterSet": {
          "idc_data_version": "text",
          "filters": {
            "project_short_name": [
              "text"
            ],
            "sample_type": [
              "text"
            ],
            "age_at_diagnosis_btw": [
              1
            ],
            "age_at_diagnosis_ebtw": [
              1
            ],
            "age_at_diagnosis_btwe": [
              1
            ],
            "age_at_diagnosis_ebtwe": [
              1
            ],
            "age_at_diagnosis_gte": [
              1
            ],
            "age_at_diagnosis_lte": [
              1
            ],
            "age_at_diagnosis_gt": [
              1
            ],
            "age_at_diagnosis_lt": [
              1
            ],
            "age_at_diagnosis_eq": [
              1
            ],
            "bmi_btw": [
              1
            ],
            "bmi_ebtw": [
              1
            ],
            "bmi_btwe": [
              1
            ],
            "bmi_ebtwe": [
              1
            ],
            "bmi_gte": [
              1
            ],
            "bmi_lte": [
              1
            ],
            "bmi_gt": [
              1
            ],
            "bmi_lt": [
              1
            ],
            "bmi_eq": [
              1
            ],
            "country": [
              "text"
            ],
            "disease_code": [
              "text"
            ],
            "ethnicity": [
              "text"
            ],
            "gender": [
              "text"
            ],
            "histological_type": [
              "text"
            ],
            "pathologic_stage": [
              "text"
            ],
            "project_name": [
              "text"
            ],
            "race": [
              "text"
            ],
            "tumor_tissue_site": [
              "text"
            ],
            "vital_status": [
              "text"
            ],
            "analysis_results_id": [
              "text"
            ],
            "AnatomicRegionSequence": [
              "text"
            ],
            "Apparent_Diffusion_Coefficient_btw": [
              1
            ],
            "Apparent_Diffusion_Coefficient_ebtw": [
              1
            ],
            "Apparent_Diffusion_Coefficient_btwe": [
              1
            ],
            "Apparent_Diffusion_Coefficient_ebtwe": [
              1
            ],
            "Apparent_Diffusion_Coefficient_gte": [
              1
            ],
            "Apparent_Diffusion_Coefficient_lte": [
              1
            ],
            "Apparent_Diffusion_Coefficient_gt": [
              1
            ],
            "Apparent_Diffusion_Coefficient_lt": [
              1
            ],
            "Apparent_Diffusion_Coefficient_eq": [
              1
            ],
            "BodyPartExamined": [
              "text"
            ],
            "Calcification": [
              "text"
            ],
            "CancerType": [
              "text"
            ],
            "collection_id": [
              "text"
            ],
            "Diameter_btw": [
              1
            ],
            "Diameter_ebtw": [
              1
            ],
            "Diameter_btwe": [
              1
            ],
            "Diameter_ebtwe": [
              1
            ],
            "Diameter_gte": [
              1
            ],
            "Diameter_lte": [
              1
            ],
            "Diameter_gt": [
              1
            ],
            "Diameter_lt": [
              1
            ],
            "Diameter_eq": [
              1
            ],
            "illuminationType": [
              "text"
            ],
            "Internal_structure": [
              "text"
            ],
            "license_short_name": [
              "text"
            ],
            "Lobular_Pattern": [
              "text"
            ],
            "Malignancy": [
              "text"
            ],
            "Manufacturer": [
              "text"
            ],
            "ManufacturerModelName": [
              "text"
            ],
            "Margin": [
              "text"
            ],
            "max_TotalPixelMatrixColumns_btw": [
              1
            ],
            "max_TotalPixelMatrixColumns_ebtw": [
              1
            ],
            "max_TotalPixelMatrixColumns_btwe": [
              1
            ],
            "max_TotalPixelMatrixColumns_ebtwe": [
              1
            ],
            "max_TotalPixelMatrixColumns_gte": [
              1
            ],
            "max_TotalPixelMatrixColumns_lte": [
              1
            ],
            "max_TotalPixelMatrixColumns_gt": [
              1
            ],
            "max_TotalPixelMatrixColumns_lt": [
              1
            ],
            "max_TotalPixelMatrixColumns_eq": [
              1
            ],
            "max_TotalPixelMatrixRows_btw": [
              1
            ],
            "max_TotalPixelMatrixRows_ebtw": [
              1
            ],
            "max_TotalPixelMatrixRows_btwe": [
              1
            ],
            "max_TotalPixelMatrixRows_ebtwe": [
              1
            ],
            "max_TotalPixelMatrixRows_gte": [
              1
            ],
            "max_TotalPixelMatrixRows_lte": [
              1
            ],
            "max_TotalPixelMatrixRows_gt": [
              1
            ],
            "max_TotalPixelMatrixRows_lt": [
              1
            ],
            "max_TotalPixelMatrixRows_eq": [
              1
            ],
            "min_PixelSpacing_btw": [
              1
            ],
            "min_PixelSpacing_ebtw": [
              1
            ],
            "min_PixelSpacing_btwe": [
              1
            ],
            "min_PixelSpacing_ebtwe": [
              1
            ],
            "min_PixelSpacing_gte": [
              1
            ],
            "min_PixelSpacing_lte": [
              1
            ],
            "min_PixelSpacing_gt": [
              1
            ],
            "min_PixelSpacing_lt": [
              1
            ],
            "min_PixelSpacing_eq": [
              1
            ],
            "Modality": [
              "text"
            ],
            "ObjectiveLensPower": [
              1
            ],
            "PatientID": [
              "text"
            ],
            "primaryAnatomicStructure": [
              "text"
            ],
            "SamplesPerPixel": [
              "text"
            ],
            "SegmentAlgorithmName": [
              "text"
            ],
            "SegmentAlgorithmType": [
              "text"
            ],
            "SegmentedPropertyCategoryCodeSequence": [
              "text"
            ],
            "SegmentedPropertyTypeCodeSequence": [
              "text"
            ],
            "SeriesDescription": [
              "text"
            ],
            "SeriesInstanceUID": [
              "text"
            ],
            "SeriesNumber": [
              "text"
            ],
            "SliceThickness_btw": [
              1
            ],
            "SliceThickness_ebtw": [
              1
            ],
            "SliceThickness_btwe": [
              1
            ],
            "SliceThickness_ebtwe": [
              1
            ],
            "SliceThickness_gte": [
              1
            ],
            "SliceThickness_lte": [
              1
            ],
            "SliceThickness_gt": [
              1
            ],
            "SliceThickness_lt": [
              1
            ],
            "SliceThickness_eq": [
              1
            ],
            "SOPClassUID": [
              "text"
            ],
            "SOPInstanceUID": [
              "text"
            ],
            "Sphericity": [
              "text"
            ],
            "Sphericity_quant_btw": [
              1
            ],
            "Sphericity_quant_ebtw": [
              1
            ],
            "Sphericity_quant_btwe": [
              1
            ],
            "Sphericity_quant_ebtwe": [
              1
            ],
            "Sphericity_quant_gte": [
              1
            ],
            "Sphericity_quant_lte": [
              1
            ],
            "Sphericity_quant_gt": [
              1
            ],
            "Sphericity_quant_lt": [
              1
            ],
            "Sphericity_quant_eq": [
              1
            ],
            "Spiculation": [
              "text"
            ],
            "StudyDate": [
              "text"
            ],
            "StudyDescription": [
              "text"
            ],
            "StudyInstanceUID": [
              "text"
            ],
            "Subtlety_score": [
              "text"
            ],
            "Surface_area_of_mesh_btw": [
              1
            ],
            "Surface_area_of_mesh_ebtw": [
              1
            ],
            "Surface_area_of_mesh_btwe": [
              1
            ],
            "Surface_area_of_mesh_ebtwe": [
              1
            ],
            "Surface_area_of_mesh_gte": [
              1
            ],
            "Surface_area_of_mesh_lte": [
              1
            ],
            "Surface_area_of_mesh_gt": [
              1
            ],
            "Surface_area_of_mesh_lt": [
              1
            ],
            "Surface_area_of_mesh_eq": [
              1
            ],
            "tcia_species": [
              "text"
            ],
            "tcia_tumorLocation": [
              "text"
            ],
            "Texture": [
              "text"
            ],
            "Volume_btw": [
              1
            ],
            "Volume_ebtw": [
              1
            ],
            "Volume_btwe": [
              1
            ],
            "Volume_ebtwe": [
              1
            ],
            "Volume_gte": [
              1
            ],
            "Volume_lte": [
              1
            ],
            "Volume_gt": [
              1
            ],
            "Volume_lt": [
              1
            ],
            "Volume_eq": [
              1
            ],
            "Volume_of_Mesh_btw": [
              1
            ],
            "Volume_of_Mesh_ebtw": [
              1
            ],
            "Volume_of_Mesh_btwe": [
              1
            ],
            "Volume_of_Mesh_ebtwe": [
              1
            ],
            "Volume_of_Mesh_gte": [
              1
            ],
            "Volume_of_Mesh_lte": [
              1
            ],
            "Volume_of_Mesh_gt": [
              1
            ],
            "Volume_of_Mesh_lt": [
              1
            ],
            "Volume_of_Mesh_eq": [
              1
            ]
          }
        },
        "sql": "text"
      },
      "manifest": {
        "manifest_data": [
          {
            "age_at_diagnosis": 1,
            "analysis_results_id": "text",
            "AnatomicRegionSequence": "text",
            "Apparent_Diffusion_Coefficient": 1,
            "aws_bucket": "text",
            "aws_url": "text",
            "bmi": 1,
            "BodyPartExamined": "text",
            "Calcification": "text",
            "CancerType": "text",
            "collection_id": "text",
            "country": "text",
            "crdc_instance_uuid": "text",
            "crdc_series_uuid": "text",
            "crdc_study_uuid": "text",
            "Diameter": 1,
            "disease_code": "text",
            "ethnicity": "text",
            "FrameOfReferenceUID": "text",
            "gcs_bucket": "text",
            "gcs_url": "text",
            "gender": "text",
            "histological_type": "text",
            "illuminationType": "text",
            "instance_size": 1,
            "Internal_structure": "text",
            "license_short_name": "text",
            "Lobular_Pattern": "text",
            "Malignancy": "text",
            "Manufacturer": "text",
            "ManufacturerModelName": "text",
            "Margin": "text",
            "max_TotalPixelMatrixColumns": 1,
            "max_TotalPixelMatrixRows": 1,
            "min_PixelSpacing": 1,
            "Modality": "text",
            "ObjectiveLensPower": 1,
            "pathologic_stage": "text",
            "PatientID": "text",
            "primaryAnatomicStructure": "text",
            "Program": "text",
            "project_name": "text",
            "project_short_name": "text",
            "race": "text",
            "sample_type": "text",
            "SamplesPerPixel": 1,
            "SegmentAlgorithmType": "text",
            "SegmentedPropertyCategoryCodeSequence": "text",
            "SegmentedPropertyTypeCodeSequence": "text",
            "SegmentNumber": 1,
            "SeriesDescription": "text",
            "SeriesInstanceUID": "text",
            "SeriesNumber": "text",
            "SliceThickness": 1,
            "SOPClassUID": "text",
            "SOPInstanceUID": "text",
            "source_DOI": "text",
            "Sphericity": "text",
            "Sphericity_quant": 1,
            "Spiculation": "text",
            "StudyDate": "text",
            "StudyDescription": "text",
            "StudyInstanceUID": "text",
            "Subtlety_score": "text",
            "Surface_area_of_mesh": 1,
            "tcia_species": "text",
            "tcia_tumorLocation": "text",
            "Texture": "text",
            "tumor_tissue_site": "text",
            "vital_status": "text",
            "Volume": 1,
            "Volume_of_Mesh": 1,
            "instance_count": 1,
            "series_count": 1,
            "study_count": 1,
            "patient_count": 1,
            "collection_count": 1,
            "group_size": 1
          }
        ],
        "totalFound": 1,
        "rowsReturned": 1
      },
      "next_page": "text",
      "code": 1
    }
    GET /cohorts/manifest/preview/nextPage?next_page=text HTTP/1.1
    Host: 
    Accept: */*
    
    {
      "manifest": {
        "manifest_data": [
          {
            "age_at_diagnosis": 1,
            "analysis_results_id": "text",
            "AnatomicRegionSequence": "text",
            "Apparent_Diffusion_Coefficient": 1,
            "aws_bucket": "text",
            "aws_url": "text",
            "bmi": 1,
            "BodyPartExamined": "text",
            "Calcification": "text",
            "CancerType": "text",
            "collection_id": "text",
            "country": "text",
            "crdc_instance_uuid": "text",
            "crdc_series_uuid": "text",
            "crdc_study_uuid": "text",
            "Diameter": 1,
            "disease_code": "text",
            "ethnicity": "text",
            "FrameOfReferenceUID": "text",
            "gcs_bucket": "text",
            "gcs_url": "text",
            "gender": "text",
            "histological_type": "text",
            "illuminationType": "text",
            "instance_size": 1,
            "Internal_structure": "text",
            "license_short_name": "text",
            "Lobular_Pattern": "text",
            "Malignancy": "text",
            "Manufacturer": "text",
            "ManufacturerModelName": "text",
            "Margin": "text",
            "max_TotalPixelMatrixColumns": 1,
            "max_TotalPixelMatrixRows": 1,
            "min_PixelSpacing": 1,
            "Modality": "text",
            "ObjectiveLensPower": 1,
            "pathologic_stage": "text",
            "PatientID": "text",
            "primaryAnatomicStructure": "text",
            "Program": "text",
            "project_name": "text",
            "project_short_name": "text",
            "race": "text",
            "sample_type": "text",
            "SamplesPerPixel": 1,
            "SegmentAlgorithmType": "text",
            "SegmentedPropertyCategoryCodeSequence": "text",
            "SegmentedPropertyTypeCodeSequence": "text",
            "SegmentNumber": 1,
            "SeriesDescription": "text",
            "SeriesInstanceUID": "text",
            "SeriesNumber": "text",
            "SliceThickness": 1,
            "SOPClassUID": "text",
            "SOPInstanceUID": "text",
            "source_DOI": "text",
            "Sphericity": "text",
            "Sphericity_quant": 1,
            "Spiculation": "text",
            "StudyDate": "text",
            "StudyDescription": "text",
            "StudyInstanceUID": "text",
            "Subtlety_score": "text",
            "Surface_area_of_mesh": 1,
            "tcia_species": "text",
            "tcia_tumorLocation": "text",
            "Texture": "text",
            "tumor_tissue_site": "text",
            "vital_status": "text",
            "Volume": 1,
            "Volume_of_Mesh": 1,
            "instance_count": 1,
            "series_count": 1,
            "study_count": 1,
            "patient_count": 1,
            "collection_count": 1,
            "group_size": 1
          }
        ],
        "totalFound": 1,
        "rowsReturned": 1
      },
      "next_page": "text",
      "code": 1
    }
    GET /about HTTP/1.1
    Host: 
    Accept: */*
    
    GET /versions HTTP/1.1
    Host: 
    Accept: */*
    
    {
      "idc_data_versions": [
        {
          "idc_data_version": "text",
          "date_active": "text",
          "active": "text"
        }
      ],
      "code": 1
    }
    GET /collections HTTP/1.1
    Host: 
    Accept: */*
    
    {
      "collections": [
        {
          "collection_id": "text",
          "cancer_type": "text",
          "date_updated": "text",
          "description": "text",
          "source_doi": "text",
          "source_url": "text",
          "image_types": "text",
          "location": "text",
          "species": "text",
          "subject_count": 1,
          "supporting_data": "text"
        }
      ],
      "code": 1
    }
    GET /analysis_results HTTP/1.1
    Host: 
    Accept: */*
    
    {
      "analysisResults": [
        {
          "analysis_result_id": "text",
          "analysisArtifacts": "text",
          "cancer_type": "text",
          "collections": "text",
          "date_updated": "text",
          "description": "text",
          "doi": "text",
          "location": "text",
          "subjects": 1,
          "title": "text"
        }
      ],
      "code": 1
    }
    GET /fields/{version} HTTP/1.1
    Host: 
    Accept: */*
    
    GET /filters HTTP/1.1
    Host: 
    Accept: */*
    
    GET /filters/values/{filter} HTTP/1.1
    Host: 
    Accept: */*
    
    - details in
  • You can use open source VolView zero-footprint viewer to visualize and volume render any image series by simply pointing it to the cloud bucket with the files - see details in this tutorial

  • this
    page
    IDC Discourse
    IDC Discourse
    https://viewer.imaging.datacommons.cancer.gov/viewer/
    https://viewer.imaging.datacommons.cancer.gov/slim/studies
    https://viewer.imaging.datacommons.cancer.gov/viewer/1.3.6.1.4.1.14519.5.2.1.6279.6001.224985459390356936417021464571
    https://viewer.imaging.datacommons.cancer.gov/viewer/1.3.6.1.4.1.14519.5.2.1.6279.6001.224985459390356936417021464571?seriesInstanceUID=1.2.276.0.7230010.3.1.3.0.57823.1553343864.578877,1.3.6.1.4.1.14519.5.2.1.6279.6001.273525289046256012743471155680
    https://viewer.imaging.datacommons.cancer.gov/slim/studies/2.25.211094631316408413440371843585977094852/series/1.3.6.1.4.1.5962.99.1.217222191.146280326.1640894762031.2.0
    OHIF
    Slim
    OHIF FireCloud deployment tutorial
    Slim FireCloud deployment tutorial
    itkWidgets
    Selected Series panel showing series objects not viewable at the series level
    this tutorial
    In this tutorial, we will use the series identified by SeriesInstanceUID from the TCGA-ACC collection 1.3.6.1.4.1.5962.99.1.3140643155.174517037.1639523215699.2.0, which you can download as follows:

    Next, open QuPath and select "File > Open".

    Choose just one of the .dcm files that belong to the desired dataset, then click Open. The remaining files will be automatically detected and should not be selected.

    When prompted for an image type, select Brightfield H&E (or whatever is appropriate for the dataset being opened), then click Apply. This is a QuPath feature intended to aid in analysis, and is further described in the QuPath documentation.

    The image should now display, and can be navigated by zooming/panning as described in the QuPath documentation.

    Zooming and panning in real time:

    The Image tab on the left side of the screen shows dimension information, and lists any associated images. In this case, a thumbnail image is present under Associated Images at the bottom of the Image tab. Double-clicking on Series 1 (THUMBNAIL) will open the thumbnail image in a separate window:

    Open a fluorescence DICOM dataset

    For this part, we will use a slide from the HTAN-OHSU collection identified by SeriesInstanceUID 1.3.6.1.4.1.5962.99.1.1999932010.1115442694.1655562373738.4.0. As before, you can download it as follows:

    As in the brightfield case, open QuPath and select File > Open.

    Choose just one of the .dcm files in the dataset, as the other files will be automatically detected. It does not matter which file is selected. When prompted, set the image type to Fluorescence, or as appropriate for the dataset:

    The image should then display, and can be navigated by zooming/panning as described in the QuPath documentation.

    The Image tab indicates the number of channels (12 in this case). By default, all channels will be displayed at once. This can be changed by selecting View > Brightness/Contrast or the "half-circles" icon in the toolbar:

    Unchecking the Show box will hide the channel's data, and update the image.

    QuPath
    ownload instructions

    IDC API Concepts

    The IDC API is based on IDC Data Model concepts. Several of these concepts have been previously introduced in the context of the IDC Portal. We discuss these concepts here with respect to the IDC API.

    IDC Versions

    As described previously, IDC data is versioned such that searching an IDC version according to some criteria (some filter set as described below) will always identify exactly the same set of DICOM objects.

    The GET /versions API endpoint returns a list of the current and previous IDC data versions.

    Original Collections

    An original collection is a set of DICOM data provided by a single source. (We usually just use collection to mean original collection.) Such collections are comprised primarily of DICOM image data that was obtained from some set of patients. However some original collections also include annotations, segmentations or other analyses of the image data in the collection. Typically, the patients in an collection are related by a common cancer type, though this is not always the case.

    The GET /collections endpoint returns a list of the original collections, in the current IDC version. Some metadata about each collection is provided.

    Analysis Results

    Analysis results are comprised of DICOM data that was generated by analyzing data in one or more original collections. Typically such analysis is performed by a different entity than that which provided the original collection(s) on which the analysis is based. Examples of data in analysis collections include segmentations, annotations and further processing of original images.

    Because a DICOM instance in an analysis result is "in" the same series and study as the DICOM instance data of which it is an analysis result, it is also "in" the same patient, and therefore is considered to be "in" the same collection.

    Specifically, each instance in IDC data has an associated collection_id. An analysis result will have the same collection_id as the original collection of which it is an analysis result.

    The GET /analysis_results endpoint returns a list of the analysis results, with some metadata, in the current IDC version.

    Filter Sets

    A filter set selects some set of DICOM objects in IDC hosted data, and is a set of conditions, where each condition is defined by an attribute and an array of values. An attribute identifies a field (column) in some data source (BQ table). Each filter set also includes the IDC data version upon which it operates.

    Filter sets are JSON encoded. Here is an example filter set:

    A filter set selects a DICOM instance if, for every attribute in the filter set, the instance's corresponding value satisfies one or more of the values in the associated array of values. This is explained further below.

    For example, the (attribute, [values]) pair ("Modality", ["MR", "CT"]) is satisfied if an instance "has" a Modality of MR or CT.

    Note that if a filter set includes more than one (attribute, [values]) pair having the same attribute, then only the last such (attribute, [values]) pair is used. Thus if a filter group includes the (attribute, [values]) pairs ("Modality", ["MR"]) and ("Modality", ["CT"]), in that order, only ("Modality", ["CT"]) is used.

    The filter set above will select any instance in the current IDC version that is in the TCGA_KIRC collection or the TCGA_LUAD' collections. To be selected by the filter, an instance must also have a Modality of CT or MR, and an age_at_diagnosis value between 65 and 75 .

    Because of the hierarchical nature of DICOM, if a filter set selects an instance, it implicitly selects the series, study, patient and collection which contain that instance. A manifest can be configured to return data about some or all of these entities.

    Note that when defining a cohort through the API, the IDC version is always the current IDC version.

    Data Sources

    IDC maintains a set of GCP BigQuery (BQ) tables containing various types of metadata that together describe IDC data.

    In the context of the API, a data source (or just source) is a BQ table that contains some portion of the metadata against which a filter set is applied. An API query to construct a manifest is performed against one or more such tables as needed.

    Attributes

    Both the IDC Web App and API expose selected fields against which queries can be performed. The /filters endpoint returns the available filter attributes The /filters/values/{filter} endpoint returns a list of the values which a specified Categorical String or Categorical Numeric filter attribute will match. Each attribute has a data type, one of:

    • String: An attribute with data type String may have an arbitrary string value. For example, the possible values of a StudyDescription attribute are arbitrary. An object is selected if its String attribute matches any of the values in the values array. Matching is insensitive to the case (upper case, lower case) of the characters in the strings. Thus ("StudyDescription",["PETCT Skull-Thigh"] will match a StudyDescription containing the substring "PETCT SKULL-THIGH", or "petct skull-thigh" etc. Pattern matching in String attributes is also supported. The ('StudyDescription",["%SKULL%", "ABDOMEN%", "%Pelvis"]) filter will match any StudyDescription that contains "SKULL", "skull", "Skull", etc., starts with "ABDOMEN", "abdomen", etc., or ends with "Pelvis", "PELVIS", etc.

    • Categorical String An attribute with data type Categorical String will have one of a defined set of string values. For example, Modality is an Categorical String attribute that has possible values 'CT', 'MR', 'PT', etc. Categorical String attributes have the same matching semantics as for Strings. The /filters/values/{filter} endpoint returns a list of the values accepted for a specified Categorical String attribute (filter).

    Cohorts

    A cohort is the set of DICOM objects in IDC hosted data selected by a filter set.

    The API no longer supports user defined cohorts. However, the POST /cohorts/manifest/preview endpoint effectively creates a cohort, queries the cohort to obtain a manifest of metadata of the objects in the cohort, and then deletes the cohort. The data in the manifest is highly configurable and can be used, with suitable tools, to obtain DICOM files from cloud storage. A manifest returned by the API can include values from a large set of fields.

    Manifests are discussed in the next section.

    IDC API UI

    The can be used to see details about the syntax of each call, and also provides an interface to test requests. Each endpoint is also documented the section.

    Make a Request

    For a quick demonstration of the syntax of an API call, test the request. You can experiment with this endpoint by clicking the ‘Try it out’ button, and then the 'Execute' button.

    The API will return collection metadata for the current IDC data version.

    Request Response

    The Swagger UI submits the request and shows the curl code that was submitted. The Response body section will display the response to the request. The expected JSON schema format of the response to this API request is shown below:

    The actual JSON formatted response can be downloaded to your local file system by clicking the ‘Download’ button.

    DICOMweb

    We welcome your questions or comments on this documentation page content! Please start a thread on , and we will be happy to help you!

    Background

    The interface is available for accessing IDC data. This interface could be especially useful for efficiently downloading small(er) parts of large digital pathology images. While the entire pathology whole-slide image (WSI) pyramid can reach gigabytes in size, the part that is needed for a specific visualization or analysis task can be rather small and localized to the specific image tiles at a given resolution.

    IDC Data Model Concepts

    The IDC API is based on several IDC Data Model Concepts.

    Cohorts

    "In statistics, marketing and demography, a cohort is a group of who share a defining characteristic (typically subjects who experienced a common event in a selected time period, such as birth or graduation)." ()

    In IDC, a cohort is a set of subjects (DICOM patients) that are identified by applying a Filter Set to the

    idc download 1.3.6.1.4.1.5962.99.1.3140643155.174517037.1639523215699.2.0
    idc download 1.3.6.1.4.1.5962.99.1.1999932010.1115442694.1655562373738.4.0
  • Categorical Numeric An attribute with data type Categorical Numeric has one of a defined set of numeric values. The corresponding value array must have a single numeric value. The (attribute, value array) pair for a Categorical Numeric is satisfied if the attribute is equal to the value in the value array. The /filters/values/{filter} endpoint returns a list of the values accepted for a Categorical Numeric attribute (filter).

  • Ranged Integer An attribute with data type Ranged Integer will have an integer value. For example, age_at_diagnosis is an attribute of data type Ranged Integer. In order to enable relative numeric queries, the API exposes eight variations of each Ranged Integer attribute as filter attribute names. These variations are the base attribute name with one of the suffixes: eq, gt, gte, btw, btwe, ebtw, ebtwe, lte, or lt, e.g. age_at_diagnosis_eq. The value array of the btw, btwe, ebtw, and ebtwe variations must contain exactly two integer values, in numeric order (least value first). The value array of the eq, gt, gte, lte, and lt variations must contain exactly one integer values. The (attribute, value array) pair for a Ranged Integer attribute is satisfied according to the suffix as follows:

    • eq: If an attribute is equal to the value in the value array

    • gt: If an attribute is greater than the value in the value array

    • gte: If an attribute is greater than or equal to the value in the value array

    • btw: if an attribute

    Ranged Number An attribute with data type Ranged Number will have a numeric (integer or float) value. For example, diameter is an attribute of data type Ranged Number. In order to enable relative numeric queries, the API exposes eight variations of each Ranged Number attribute as filter attribute names. These variations are the base attribute name with one of the suffixes: eq, gt, gte, btw, btwe, ebtw, ebtwe, lte, or lt, e.g. diameter_eq. The value array of the btw, btwe, ebtw, and ebtwe variations must contain exactly two numeric values, in numeric order (least value first). The value array of the eq, gt, gte, lte, and lt variations must contain exactly one numeric values. The (attribute, value array) pair for a Ranged Number attribute is satisfied according to the suffix as follows:

    • eq: If an attribute is equal to the value in the value array

    • gt: If an attribute is greater than the value in the value array

    • gte: If an attribute is greater than or equal to the value in the value array

  • IDC API UI
    Endpoint Details
    GET /collections

    New to DICOM WSI? Check out our introductory tutorial to learn how slide microscopy images are organized in DICOM.

    Detailed information on the DICOMweb endpoints that are available to access IDC data is provided here. In brief, there are two DICOM stores available - the IDC-maintained DICOM store and the Google-maintained DICOM store - we recommend that you familiarize yourself with the documentation to learn about the differences between the two, and select the option that is optimal for your use case.

    Code snippets included in this article are also replicated in this Google Colab tutorial notebook for your convenience.

    Unique identifiers: locating the relevant slides

    IDC uses DICOM for data organization, and every image contains metadata organized following the data model documented here. Each slide corresponds to a DICOM Series, uniquely identified by the SeriesInstanceUID, which in turn belongs to a DICOM Study identified by the StudyInstanceUID. You will need these two identifiers to access any DICOM slide using DICOMweb!

    Since IDC contains many terabytes of images, you will typically want to first select images/slides that meet your needs. IDC offers various interfaces to explore and subset the data, starting from the IDC Portal, to the Python package idc-index (covered in this tutorial) and BigQuery SQL interfaces (see this tutorial). We strongly recommend you work through the referenced tutorials, but for the purposes of this tutorial, we will demonstrate how you can locate UIDs of a slide that corresponds to pancreas tissue.

    First, install idc-index with pip install —upgrade idc-index (--upgrade part is very important to make sure you are working with the latest data release of IDC!).

    Next, the following snippet demonstrates how to select slides of pancreas tissue (you can also select by the lens magnification, stain, and many other attributes - see this tutorial for details).

    Next, we select the first slide and will use its StudyInstanceUID and SeriesInstanceUID in the subsequent sections of the code.

    ('2.25.25332367070577326639024635995523878122', '1.3.6.1.4.1.5962.99.1.3380245274.1362068963.1639762817818.2.0')

    Reading slide regions via DICOMweb

    We recommend the following two Python libraries that facilitate access to a DICOM store via DICOMweb:

    • wsidicom

    • ez-wsi-dicomweb

    Both libraries can be installed using pip:

    wsidicom is based upon the dicomweb_client Python library, while ez-wsi-dicomweb includes its own DICOMweb implementation.

    Note that you can use wsidicom with both the IDC-maintained and the Google-maintained DICOM store, while ez-wsi-dicomweb only works with the Google-maintained store.

    The following code snippets show exemplarily how to use each of the libraries to access a subregion from a DICOM slide identified by the following UIDs we selected earlier:

    • sample_study_uid = 2.25.25332367070577326639024635995523878122

    • sample_series_uid = 1.3.6.1.4.1.5962.99.1.3380245274.1362068963.1639762817818.2.0

    wsidicom

    When you work with wsidicom, the first step requires setting up dicomweb_client’s DICOMwebClient:

    If you are accessing the Google-maintained DICOM store, you need to authenticate with your Google credentials first and set up an authorized session for the DICOMwebClient.

    As discussed in the corresponding documentation page we mentioned earlier, Google-hosted DICOM store may not contain the latest version of IDC data! You will encounter access issues for slides that are not present. If this the case, you will need to use the IDC-hosted DICOM store instead!

    Otherwise, if you prefer using IDC-maintained proxied DICOM store, you can skip ahead and just set up your DICOMwebClient using the proxy URL.

    Slide access with wsidicom

    You now need to wrap the previously set-up DICOMwebClient into wsidicom’s WsiDicomWebClient. Then you can use the open_web() functionality to find, open and navigate the content of the selected slide:

    To access a certain part of a slide, wsidicom offers the read_region() functionality:

    Screenshot of slide region

    ez-wsi-dicomweb

    The following code shows how to set-up an interface for DICOMweb with ez-wsi-dicomweb. You can only use this interface for accessing data from the Google-maintained DICOM store, which means authentication with you Google account is required.

    The slide, slide level information and slide regions can be accessed as follows. To accelerate image retrieval, ez-wsi-dicomweb can be configured to fetch frames in blocks and cache them for subsequent use. For more information, check out this notebook, section “Enabling EZ-WSI DICOMweb Frame Cache”.

    Screenshot of slide region

    Iterating through tiles using DICOMweb

    To iterate over image tiles you can simply wrap the functionality presented above into your own function that iterates over the coordinates of interest to you. In case you prefer to iterate over the frames as they are stored within the DICOM file, wsidicom does also offer a read_tile() method. Iteration over a slide and accessing tiles from an area defined by a tissue mask can be quite easily achieved using ez-wsi-dicomweb’s DICOMPatchGenerator as described in this notebook in section “Generating patches from a level image”.

    Recommendations

    Both libraries — ez-wsi-dicomweb and wsidicom— can be recommended for reliable DICOMweb access to IDC data. Based on our experience, ez-wsi-dicomweb is often faster, likely due to its caching capabilities, and customizations for efficient access to image patches from a Google DICOM store for AI model training. wsidicom, on the other hand, is a more general-purpose tool offering extensive functionality for accessing DICOM files (images as well as annotation files) both from local disk or from the cloud via DICOMweb. It is important to note that when running code locally, access times may be slightly longer compared to cloud-based (such as in a Colab notebook) execution.

    IDC forum
    DICOMweb
    Data
    Sources
    of some
    IDC
    data
    version
    . Because a
    cohort
    is defined with respect to an
    IDC data version
    , the set of subjects in the
    cohort
    , as well as all metadata associated with those subjects, is exactly and repeatably defined.

    IDC Data Version

    Over time, the set of data hosted by the IDC will change. For the most part, such changes will be due to new data having been added. The totality of IDC hosted data resulting from any such change is represented by a unique IDC data version ID. That is, each time that the set of publicly available data changes, a new IDC version is created that exactly defines the revised data set.

    The IDC data version is intended to enable the reproducibility of research results. For example, consider a patient in the DICOM data model. Over time, new studies might be performed on a patient and become associated with that patient, and the corresponding DICOM instances will then be added to the IDC hosted data. Moreover, additional patients might well be added to the IDC data set over time. This means that the set of subjects defined by some filtering operation will change over time. Thus, for purposes of reproducibility, we define a cohort in terms of a set of filter groups and an IDC data version.

    IDC cohort is uniquely defined by the combination of a set of filter groups and an IDC data version.

    Note that on occasion some data might be removed from a collection, though this is expected to be rare. Such a removal will result in a new IDC data version which excludes that data. Such removed data will, however, continue to be available in any previous IDC data version in which it was available. There is one exception: data that is found to contain Personally Identifiable Information (PII) or Protected Health Information (PHI) will be removed from all IDC data versions.

    Note: currently a cohort is always defined in terms of a single filter group and an IDC Data Version. In the future we may add support for multiple filter groups.

    Filter Group

    A filter group selects some set of subjects in the IDC hosted data, and is a set of conditions, where each condition is defined by an attribute and an array of values. An attribute identifies a field (column) in some data source (BQ table). Each filter group also specifies the IDC data version upon which it operates.

    A filter group selects a subject if, for every attribute in the filter group, some datum associated with the subject satisfies one or more of the values in the associated array of values. A datum satisfies a value if it is equal to, less than, less than or equal to, between, greater than or equal to, or greater than, as required by the attribute. This is explained further below.

    For example, the (attribute, [values]) pair (Modality, [MR, CT]) is satisfied if a subject "has" a Modality of MR or CT in any data associated with that subject. Thus, this (attribute, [values]) pair would be satisfied, for example, by a subject who has one or more MR series but no CT series.

    Note that if a filter group includes more than one (attribute, [values]) pair having the same attribute, then only the last such (attribute, [values]) pair is used. Thus if a filter group includes the (attribute, [values]) pairs (Modality, [MR]) and (Modality, [CT]), in that order, only (Modality, [CT]) is used.

    Here is an example filter group:

    This filter group will select any subject in the TCGA-LUAD or TCGA-KIRC collections, if the subject has any DICOM instances having a modality of CT or MR, the subject's race is WHITE, and the subjects age at diagnosis is between 53 and 69.

    Collections

    A collection is a set of DICOM data provided by a single source. Collections are further categorized as Original collections or Analysis collections. Original collections are comprised primarily of DICOM image data that was obtained from some set of patients. Typically, the patients in an Original collection are related by a common disease.

    Analysis collections are comprised of DICOM data that was generated by analyzing other (typically Original) collections. Typically such analysis is performed by a different entity than that which provided the original collection(s) on which the analysis is based. Examples of data in analysis collections include segmentations, annotations and further processing of original images. Note that some Original collections include such data, though most of the data in Original collections are original images.

    Data Source

    A data source is a BQ table that contains some part of the IDC metadata complement. API queries are performed against one or more such tables that are joined (in the relational database model sense). Data sources are classified as being of type Original, Derived or Related. Original data sources contain DICOM metadata from the DICOM objects in TCIA Original and TCIA Analysis collections. Derived data sources contain processed data: in general this is analytical data has been processed to enable easier SQL searches. Related data sources contain ancillary data that may be specific to some set of collections. For example, TCGA biospecimen and clinical data are maintained in such tables.

    Data sources are versioned. That is, when the data in a data source changes, a new version of that set of data is defined. An IDC data version is defined in terms of a specific version of each data source. Note that over time, new data sources may be added (or, less likely, removed). Thus two IDC data versions may have a different number of data sources.

    Attribute

    Both the IDC Web App and API expose selected fields in the various data sources against which queries can be performed. Each attribute has a data type, one of:

    • String An attribute with data type String may have an arbitrary string value. For example, the possible values of a StudyDescription attribute are arbitrary. When the values array of a (String attribute, [values]) pair contains a single value, an SQL LIKE operator is used and standard SQL syntax and semantics are supported. Thus a ('StudyDescription",["%SKULL%"]) will match any StudyDescription that contains "SKULL", When the values array of a (String attribute, [values]) pair contains more that one value, an SQL UNNEST operator is used and standard SQL syntax and semantics are supported. See the Google BigQuery documentation for details.

    • Categorical String An attribute with data type Categorical String will have one of a defined set of string values. For example, Modality is an attribute, and has possible values 'CT', 'MR', 'SR', etc. In this case, the values are defined by the DICOM specification. The defined values of other Categorical String attributes may be established by other entities. When the values array of a (Categorical String attribute, [values]) pair contains a single value, an SQL LIKE operator is used and standard SQL syntax and semantics are supported. Thus a ('StudyDescription",["%SKULL%"]) will match any StudyDescription that contains "SKULL", When the values array of a (Categorical String attribute, [values]) pair contains more that one value, an SQL UNNEST operator is used and standard SQL syntax and semantics are supported. See the documentation for details.

    • Continuous Numeric An attribute with data type Continuous Number will have a numeric (float) value. For example, age_at_diagnosis is an attribute of data type Continuous Numeric. In order to enable relative numeric queries, the API exposes 6 variations of each Continuous Numeric attributes as filter set attribute names. These variations are the base attribute name with no suffix, as well as the base attribute name with one of the suffixes: _gt, _gte, _btw, _btwe, _ebtw, _ebtwe, _lte, _lt. The value array of the _*btw* variations must contain exactly two numeric values, in numeric order (least value first). The value array of the other variations must contain exactly one numeric values. The (attribute, value array) pair for a Continuous Numeric _attribute_ is satisfied according to the suffix as follows:

    • Categorical Numeric An attribute with data type Categorical Numeric has one of a defined set of numeric values. The corresponding value array must have a single numeric value.

    Manifest

    A manifest is a list of access methods and other metadata of the data objects corresponding to the objects in some cohort. There are two types of access methods:

    • GUID

      A GUID is a persistent identifier that can be resolved to a GA4GH DRS object. GUID persistence ensures that the data which the GUID represents can continue to be located and accessed even if it has been moved to a different hosting site. A GUID identifies a particular version of an IDC data object, and there is a GUID for every version of every DICOM instance and series in IDC hosted data. GUIDs are issued by the NCI Cancer Research Data Commons. This is a typical CRDC GUID: dg.4DFC/83fdfb25-ad87-4879-b0f3-b9850ef0b216 A GUID can be resolved at https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects/ by appending the UUID to the that URL. E.G. (formatting added to the curl response for clarity):

      >> curl https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects/bd68332e-521f-4c45-9a88-e9cc426f5a8d

      { "access_methods":[{ "access_id":"gs", "access_url":{ "url":"gs://idc-open/bd68332e-521f-4c45-9a88-e9cc426f5a8d.dcm" }, "region":"", "type":"gs" } ], "aliases":[ ], "checksums":[ { "checksum":"9a63c81a4b3b4bc3950678a4e9acc930", "type":"md5" } ], "contents":[ ], "created_time":"2021-08-27T21:15:02.385181", "description":null, "form":"object", "id":"dg.4DFC/bd68332e-521f-4c45-9a88-e9cc426f5a8d", "mime_type":"application/json", "name":"", "self_uri":"drs://nci-crdc.datacommons.io/dg.4DFC/bd68332e-521f-4c45-9a88-e9cc426f5a8d", "size":528622, "updated_time":"2021-08-27T21:15:02.385185", "version":"faf7385b" }

      Resolving such a GUID returns a DrsObject. The access methods in the returned DrsObject include one or more URLs at which corresponding DICOM entities can be accessed. GUID manifests are recommended for long term archival and reference. In the above, we can see that the returned DrsObject includes the GCS URL gs://idc-open/bd68332e-521f-4c45-9a88-e9cc426f5a8d.dcm.

    • URL The URLs in a URL based manifest can be used to directly access a DICOM instance in Google Cloud Storage. URLs are structured as follows: gs://<GCS bucket>/<GUID>.dcm

      This is a typical URL: gs://idc-open/bd68332e-521f-4c45-9a88-e9cc426f5a8d.dcm

      Though rare, the URL of some object can change over time. In such a case, the corresponding DRSObject will be updated with new URL. However, the original URL will then be "stale".

    • Additional values can optionally be included in the returned manifest. See the manifest API descriptions for more details.

    IDC API UI

    The IDC API UI can be used to see details about the syntax for each call, and also provides an interface to test requests.

    Make a Request

    For a quick demonstration of the syntax of an API call, test the GET/collections request. You can experiment with this endpoint by clicking the ‘Try it out’ button.

    The API will return collection metadata for the current IDC data version. The request can be run by selecting ‘Execute’.

    Request Response

    The Swagger UI submits the request and shows the curl code that was submitted. The ‘Response body’ section will display the response to the request. The expected format of the response to this API request is shown below:

    The actual JSON formatted response can be downloaded by selecting the ‘Download’ button.

    The syntax for all of API data structures is detailed at the bottom of the UI page.

    subjects
    Wikipedia

    Portal release notes

    The version of the portal is shown at the bottom of the portal page. The semantics of the version is the following:

    canceridc.<date of webapp deployment in YYYYMMDDHHMM>.<first 6 characters of the commit hash>,

    where revision hash corresponds to that of the .

    Portal release notes are maintained at

    {
      "filters": {
        "collection_id": [
          "TCGA-LUAD",
          "TCGA-KIRC"
        ],
        "Modality": [
          "CT",
          "MR"
        ],
        "race": [
          "WHITE"
        ],
        "age_at_diagnosis_btw": [
          65, 
          75
        ]
      }
    }
    {
      "collections": [
        {
          "cancer_type": "string",
          "collection_id": "string",
          "date_updated": "string",
          "description": "string",
          "doi": "string",
          "image_types": "string",
          "location": "string",
          "species": "string",
          "subject_count": 0,
          "supporting_data": "string",
        }
      ],
      "code": 200
    }
    from idc_index import IDCClient
    
    # Instantiate the client
    idc_client = IDCClient()
    idc_client.fetch_index('sm_index')
    
    
    # Filter the slides
    query = """
    SELECT index.StudyInstanceUID, sm_index.SeriesInstanceUID
    FROM sm_index
    JOIN index ON sm_index.SeriesInstanceUID = index.SeriesInstanceUID
    WHERE Modality = 'SM' AND primaryAnatomicStructure_CodeMeaning = 'Pancreas'
    """
    
    pancreas_slides = idc_client.sql_query(query)
    sample_study_uid = pancreas_slides['StudyInstanceUID'][0]
    sample_series_uid = pancreas_slides['SeriesInstanceUID'][0]
    sample_study_uid, sample_series_uid
    pip install wsidicom  
    pip install ez-wsi-dicomweb
    from dicomweb_client.api import DICOMwebClient  
    from dicomweb_client.ext.gcp.session_utils import create_session_from_gcp_credentials
    from google.colab import auth  
    auth.authenticate_user()
    
    # Create authorized session  
    session = create_session_from_gcp_credentials()
    
    # Set-up a DICOMwebClient using the dicomweb_client library  
    google_dicom_store_url = 'https://healthcare.googleapis.com/v1/projects/nci-idc-data/locations/us-central1/datasets/idc/dicomStores/idc-store-v21/dicomWeb'  
    dw_client = DICOMwebClient(  
        url=dicom_store_url,  
        session=session  
    )
    # Set-up a DICOMwebClient using the dicomweb_client library  
    idc_dicom_store_url = 'https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb'
    
    dw_client = DICOMwebClient(url=idc_dicom_store_url)
    import wsidicom  
    import matplotlib.pyplot as plt
    
    wsidicom_client = wsidicom.WsiDicomWebClient(dw_client)  
    slide = wsidicom.WsiDicom.open_web(wsidicom_client,  
        study_uid=sample_study_uid,  
        series_uids=sample_series_uid  
    )  
    print(slide)
    [0]: Pyramid of levels:
    [0]: Level: 0, size: Size(width=171359, height=74498) px, mpp: SizeMm(width=0.2472, height=0.2472) um/px Instances: [0]: default z: 0.0 default path: 1 ImageData <wsidicom.web.wsidicom_web_image_data.WsiDicomWebImageData object at 0x7d0f16444c50>
    [1]: Level: 2, size: Size(width=42839, height=18624) px, mpp: SizeMm(width=0.988817311328, height=0.988817311328) um/px Instances: [0]: default z: 0.0 default path: 1 ImageData <wsidicom.web.wsidicom_web_image_data.WsiDicomWebImageData object at 0x7d0f16445410>
    [2]: Level: 4, size: Size(width=10709, height=4656) px, mpp: SizeMm(width=3.955546250817, height=3.955546250817) um/px Instances: [0]: default z: 0.0 default path: 1 ImageData <wsidicom.web.wsidicom_web_image_data.WsiDicomWebImageData object at 0x7d0f165271d0>
    [3]: Level: 6, size: Size(width=2677, height=1164) px, mpp: SizeMm(width=15.823662607396, height=15.823662607396) um/px Instances: [0]: default z: 0.0 default path: 1 ImageData <wsidicom.web.wsidicom_web_image_data.WsiDicomWebImageData object at 0x7d0f14192750>
    # Access and visualize 500x500px subregion at level 4, starting from pixel (1000,1000)  
    region = slide.read_region(location=(1000, 1000), level=4, size=(500, 500))  
    plt.imshow(region)  
    plt.show()
    from ez_wsi_dicomweb import dicomweb_credential_factory  
    from ez_wsi_dicomweb import dicom_slide  
    from ez_wsi_dicomweb import local_dicom_slide_cache_types  
    from ez_wsi_dicomweb import dicom_web_interface  
    from ez_wsi_dicomweb import patch_generator  
    from ez_wsi_dicomweb import pixel_spacing  
    from ez_wsi_dicomweb.ml_toolkit import dicom_path
    
    from google.colab import auth  
    auth.authenticate_user()
    
    google_dicom_store_url = 'https://healthcare.googleapis.com/v1/projects/nci-idc-data/locations/us-central1/datasets/idc/dicomStores/idc-store-v20/dicomWeb'  
    
    series_path_str = (  
          f'{google_dicom_store_url}'  
          f'/studies/{sample_study_uid}'  
          f'/series/{sample_series_uid}'  
    )  
    series_path = dicom_path.FromString(series_path_str)  
    dcf = dicomweb_credential_factory.CredentialFactory()  
    dwi = dicom_web_interface.DicomWebInterface(dcf)
    ds = dicom_slide.DicomSlide(  
        dwi=dwi,  
        path=series_path,  
        enable_client_slide_frame_decompression = True  
    )
    
    # More information: https://github.com/GoogleCloudPlatform/EZ-WSI-DICOMweb/blob/main/ez_wsi_demo.ipynb
    ds.init_slide_frame_cache(  optimization_hint=local_dicom_slide_cache_types.CacheConfigOptimizationHint.MINIMIZE_LATENCY  
    )
    # Investigate existing levels and their dimensions  
    for level in ds.levels:  
        print(f'Level {level.level_index} has pixel dimensions (row, col): {level.height, level.width}')
    Level 1 has pixel dimensions (row, col): (74498, 171359)
    Level 2 has pixel dimensions (row, col): (18624, 42839)
    Level 3 has pixel dimensions (row, col): (4656, 10709)
    Level 4 has pixel dimensions (row, col): (1164, 2677)
    # Access and visualize 500x500px subregion at level 3, starting from pixel (1000,1000)  
    level = ds.get_level_by_index(3)  
    region = ds.get_patch(level=level, x=1000, y=1000, width=500, height=500).image_bytes()  
    plt.imshow(region)  
    plt.show()
      {
        "idc_data_version": "1.0",
        "filters": {
          "collection_id": [
            "TCGA-LUAD",
            "TCGA-KIRC"
          ],
          "Modality": [
            "CT",
            "MR"
          ],
          "race": [
            "WHITE"
          ],
          "age_at_diagnosis_btw": [
            53, 69
          ]
        }
    {
      "collections": [
        {
          "cancer_type": "string",
          "collection_id": "string",
          "date_updated": "string",
          "description": "string",
          "doi": "string",
          "image_types": "string",
          "location": "string",
          "species": "string",
          "subject_count": 0,
          "supporting_data": "string",
        }
      ],
      "code": 200
    }
    is greater than the first value and less than the second value in the value array
  • ebtw: if an attribute is greater than or equal to the first value and less than the second value in the value array

  • btwe: if an attribute is greater than the first value and less than or equal to the second value in the value array

  • ebtwe: if an attribute is greater than or equal to the first value and less than or equal to the second value in the value array

  • lte: If an attribute is less than or equal to the value in the value array

  • lt: If an attribute is less than the value in the value array

  • btw: if an attribute is greater than the first value and less than the second value in the value array
  • ebtw: if an attribute is greater than or equal to the first value and less than the second value in the value array

  • btwe: if an attribute is greater than the first value and less than or equal to the second value in the value array

  • ebtwe: if an attribute is greater than or equal to the first value and less than or equal to the second value in the value array

  • lte: If an attribute is less than or equal to the value in the value array

  • lt: If an attribute is less than the value in the value array

  • <no suffix>: If an attribute is equal to the value in the value array
  • gt: If an attribute is greater than the value in the value array

  • gte: If an attribute is greater than or equal to the value in the value array

  • btw: if an attribute is gt the first value and lt the second value in the value array

  • ebtw: if an attribute is gte the first value and lt the second value in the value array

  • btwe: if an attribute is gt the first value and lte the second value in the value array

  • ebtwe: if an attribute is gte the first value and lte the second value in the value array

  • lte: If an attribute is less than or equal to the value in the value array

  • lt: If an attribute is less than the value in the value array

  • Google BigQuery
    .

    18.0 April 9, 2024 (canceridc.202404091237.841769c)

    • on the Explore Images page the IDC internal id for each collection can now be copied from the Collections table by clicking the corresponding copy icon

    • on the Explore Images page the IDC case id can now be copied from the Selected Cases table by clicking the corresponding copy icon

    17.0 December 19, 2023 (canceridc.202312191019.53f66ba)

    Main highlights of this release include:

    • add a choice of several viewers (OHIF v2, OHIF v3, VolView, Slim) for viewing image files

    16.0 September 15, 2023 (canceridc.202309151254.a6dfd6a)

    Main highlights of this release include:

    • s5cmd file manifests can now be generated from the Explore images page for individual studies and series

    15.0 July 14, 2023 (canceridc.202307141313.c80a691)

    Main highlights of this release include:

    • The file manifest for a filter can be downloaded without logging into the portal and creating a persistent cohort

    14.0 May 3, 2023 (canceridc.202305031458.443ea83)

    Main highlights of this release include:

    • The Export Cohort Manifest popup now includes options to download manifests that can be used by s5cmd to download image files from IDC's s3 buckets in GCP or AWS. Instructions are provided for using s5cmd with these manifests

    13.0 March 7, 2023 (canceridc.202303071044.57def9a)

    Main highlights of this release include:

    • Three new Original Image attributes Max Total Pixel Matrix Columns, Max Total Pixel Matrix Rows, and Min Pixel Spacing are added.

    • Two new Quantitative Analysis attributes Sphericity (Quant) and Volume of Mesh are added.

    • Default attribute value order is changed from alphanumeric (by value name) to value count.

    12.0 - November 2, 2022 (canceridc.202211092039.87ca478)

    Main highlights of this release include:

    • As limited access collections have been removed from IDC, the portal is now simplified by removing the option of selecting different access levels. All collections in the portal are public.

    • A warning message appears on the cohort browser page when a user views a cohort that used the Access filter attribute. That attribute is no longer applied if the user migrates the cohort to the current version.

    • On the explorer page the reset button has been moved to improve viewability.

    11.0 - September 8, 2022 (canceridc.202209081302.acb8ce3)

    This was primarily a data release. There were no significant changes to the portal.

    10.0 - August 3, 2022 (canceridc.202208040944.6c798a2)

    Main highlights of this release include:

    • User control over how selection of multiple filter modalities defines the cohort. Previously when multiple modalities were selected the cohort would include the cases that had ANY of the selected modalities. Now the user can choose if the cohort includes the cases that contain ANY of the selected modaltiies or just those that have ALL of the selected modalities.

    9.0 - May 19, 2022 (canceridc.202205191051)

    Main highlights of this release include:

    • Ability to select specific Analysis Results collections with segmentation and radiomic features

    • Text boxes added to the slider panels to allow the user to input upper and lower slider bounds

    • Pie chart tooltips updated to improve viewability

    8.0 - April 4, 2022 (canceridc.202204050856.2920c81)

    Main highlights of this release include:

    • Eleven new collections added

    • Number of cases, studies, and series in a cohort are reported in the filter de finition

    • On the Exploration page the Access attribute is placed in the Search Scope

    • On the Exploration page users are warned when they create a cohort that includes Limited Access collections

    • Series Instance UID is reported in the Selected Series table

    7.0 - February 7, 2022 (canceridc.202202071117.164252a)

    Main highlights of this release include:

    • The BigQuery query string corresponding to a cohort can now be displayed in user-readable format by pressing a button on either the cohort or cohort list pages

    • On the exploration page collections can now be sorted alphabetically or by the number of cases. Selected cases are ordered at the top of the collection list

    • Table rows can be selected by clicking anywhere within the row, not just on the checkbox

    • The BigQuery export cohort manifest includes the IDC data version as an optional column

    6.0 - January 10, 2022 (canceridc.202201101504.eb0e309)

    Main highlights of this release include:

    • Collections which have limited access are now denoted as such in the Collection tab on the Exploration page

    • Links to image files belonging to limited collections have been removed from the Studies and Series tables on the Exploration page

    • The quota of image file data that can be served per user per day has been reduced from 137 to 40 GB

    5.0 - December 9, 2021 (canceridc.202112091128.eb0e309)

    Main highlights of this release include:

    • New attributes including Manufacturer, Manufacturer Model Name, and Slice Thickness added

    • Checked attribute values are now shown at the top of the attribute value lists

    • Ability to search by CaseID added to the Selected Cases table

    • Ability to search by StudyID added to the Selected Studies table

    • Study Date added to the Studies Table

    • Changed the persistence of the StudyID tooltip in the tables so that the StudyID can be copied from the tooltip

    • Specific columns can now be selected in the BigQuery cohort export

    2.1.0 - August 2021 (canceridc.202108261153.70f59e0)

    The Imaging Data Commons Explore Image Data portal is a platform that allows users to explore, filter, create cohorts, and view image studies and series using cutting-edge technology viewers.

    Main highlights of this release include:

    • Support for slide microscopy series from the CPTAC-LSCC and CPTAC-LUAD collections is now included.

    • The Slim viewer is now configured to view slide microscopy series

    • Search boxes are included for very attribute to search for specific attribute values by name.

    2.0.0 - June 2021 - (canceridc.202106250849.876f912)

    The Imaging Data Commons Explore Image Data portal is a platform that allows users to explore, filter, create cohorts, and view image studies and series using cutting-edge technology viewers.

    Main highlights of this release include:

    • 112 data collections are now included

    • Cohort data version is reported

    • Cohort statistics - ie the number the cases, studies, and series per cohort are now reported

    • Mechanism included to update a version cohort

    • Species Attribute is included

    • Checkbox and plus/minus icons are now used to select table rows

    1.3.0 - March 2021 (canceridc.202103011131.27ce3b3)

    The Imaging Data Commons Explore Image Data portal is a platform that allows users to explore, filter, create cohorts, and view image studies and series using cutting-edge technology viewers.

    Main highlights of this release include:

    • The user details page will no longer return a 500 error when selected

    • Sorting of studies panel is now active for all fields

    • Re-sending of an unreceived verification email is now more clearly explained.

    • IDC identity login header and column selection is disabled for the exportation of a cohort manifest to BigQuery

    • Detailed information panel added to efficiently describe why some pie charts have multiple facets even when a filter is selected

    • Cohort manifest export popup can be scrolled down

    • Use of Shift or Control (Command for Mac) selection of studies will now behave as expected: Shift-select for a contiguous series of rows, Control/Command-select for individual rows.

    • All filter selections are now sorted by alphabetical characters

    1.2.0 - January 2021 (canceridc.202101111506.0a8af57)

    The Imaging Data Commons Explore Image Data portal is a platform that allows users to explore, filter, create cohorts, and view image studies and series using cutting-edge technology viewers.

    Main highlights of this release include:

    • Consistent number of files will be returned between the portal and BigQuery

    • When the user clicks a non-gov link a popup will appear

    • Cohort manifest export information now has clickable URLs to take you to the BigQuery console

    • Collections list displays by default 100 entries

    • Any empty search criteria is now highlighted in grey and no data will be listed

    • The user will no longer need to scroll to see search criteria in the left search configuration panel

    • Portal footer is now in compliance with NCI requirements

    • Check/uncheck in the collections panel added for collection TCGA

    1.1.0 - December 2020 (canceridc.202012091728.674fff0)

    The Imaging Data Commons Explore Image Data portal is a platform that allows users to explore, filter, create cohorts, and view image studies and series using cutting-edge technology viewers.

    Main highlights of this release include:

    • Case-level table is added to the portal

    • Cohorts can now be exported into BigQuery tables using the Export Cohort Manifest button

    • Cohorts less than 650k rows can now be downloaded as a multipart file. Cohorts larger that 600k rows can only be be exported to BigQuery (for users that are logged in with Google Accounts)

    • Quantitative filter ranges are updated dynamically with the updates to filter selection

    • Pie charts will display "No data available" message when zero cases are returned for the given filter selection

    • RTPLAN and Real World Mapping Attribute values are now disabled at the series level, since they cannot be visualized in the IDC Viewer

    • Various bug fixes in both the IDC Portal and IDC Viewer

    1.0.0 - October 2020 (canceridc.202010190226.4e8597)

    The Imaging Data Commons Explore Image Data portal is a platform that allows users to explore, filter, create cohorts, and view image studies and series using cutting-edge technology viewers.

    Main features in this initial release include:

    • The ability to search for data in BigQuery and Solr

    • The ability to search by multiple attributes:

      • Collection

      • Original attributes e.g., Modality

      • Derived attributes e.g., Segmentations

      • Qualitative analysis e.g., Lobular pattern

      • Quantitative analysis e.g., Volume

      • related attributes e.g., Country

    • Display of collections results in a tabular format with the following information:

      • Collection Name

      • Total Number of Cases

      • Number of Cases(this cohort)

    • Display of the Selected Studies results in tabular format with the following information:

      • Project Name

      • Case ID

      • Study ID

    • Display of the Selected Series results in tabular format with the following information:

      • Study ID

      • Series Number

      • Modality

    • The ability to hide attributes with zero cases present

    • The ability to save cohorts

    • The ability to download the manifest of any cohort created

    • The ability to promote, filter, and load multiple series instances in the OHIF viewer

    IDC WebApp repo
    https://github.com/ImagingDataCommons/IDC-WebApp/releases

    Study Description

    Body Part Examined

  • Series Description

  • BigQuery tables

    BQ tables are organized in BQ datasets. BQ datasets are not unlike folders on your computer, but contain tables related to each other instead of files. BQ datasets, in turn, are organized under Google Cloud projects. GCP projects can be thought of as containers that are managed by a particular organization. To continue with the file system analogy, think about projects as hard drives that contain folders.

    This may be a good time for you to complete Part 1 of the IDC "Getting started" tutorial series, so that you are able to open the tables and datasets we will be discussing in the following paragraphs!

    Let's map the aforementioned project-dataset-table hierarchy to the concrete locations that contain IDC data.

    IDC BigQuery datasets

    All of the IDC tables are stored under the bigquery-public-data project. That project is managed by Google Public Datasets Program, and contains many public BQ datasets, beyond those maintained by IDC.

    All of the IDC tables are organized into datasets by data release version. If you complete the tutorial mentioned above, open the BQ console, and scroll down the list of datasets, you will find those that are named starting with the idc_v prefix - those are IDC datasets.

    Following the prefix, you will find the number that corresponds to the IDC data release version. IDC data releases version numbers start from 1 and are incremented by one for each subsequent release. As of writing this, the most recent version of IDC is 16, and you can find dataset idc_v16 corresponding to this version.

    In addition to idc_v16 you will find a dataset named idc_v16_clinical. That dataset contains clinical data accompanying IDC collections. We started clinical data ingestion in IDC v11. If you want to learn more about the organization and searching of clinical data, take a look at the .

    Finally, you will also see two special datasets: idc_current and idc_current_clinical. Those two datasets are essentially aliases, or links, to the versioned datasets corresponding to the latest release of IDC data.

    If you want to explore the latest content of IDC - use current datasets.

    If you want to make sure your queries and data selection are reproducible - always use the version numbered datasets!

    IDC BigQuery tables

    Before we dive into discussing the individual tables maintained by IDC, there is just one more BigQuery-specific concept you need to learn: the view. BigQuery view is a table that is defined by an SQL query that is run every time you query the view (you can read more about BQ views in ).

    BQ views can be very handy when you want to simplify your queries by factoring out the part of the query that is often reused. But a key disadvantage of BQ views over tables is the reduced performance and increased cost due to re-running the query each time you query the view.

    As we will discuss further, most of the tables maintained by IDC are created by joining and/or post-processing other tables. Because of this we rely heavily on BQ views to improve transparency of the provenance of those "derived" tables. BQ views can be easily distinguished from the tables in a given dataset by a different icon. IDC datasets also follow a convention that all views in the versioned datasets include suffix _view in the name, and are accompanied by the result of running the query used by the view in a table that has the same name sans the _view suffix. See the figure below for an illustration of this convention.

    If you are ever curious (and you should be, at least once in a while!) about the queries behind individual views, you can click on the view in the BQ console, and see the query in the "Details" tab. Try this out yourself to check the query for

    Now that we reviewed the main concepts behind IDC tables organization, it is time to explain the sources of metadata contained in those tables. Leaving _clinical datasets aside, IDC tables are populated from one of the two sources:

    • DICOM metadata extracted from the DICOM files hosted by IDC, and various derivative tables that simplify access to specific DICOM metadata items;

    • collection-level and auxiliary metadata, which is not stored in DICOM tags, but is either received by IDC from other sources, or is populated by IDC as part of data curation (these include Digital Object Identifiers, description of the collections, hashsums, etc).

    The set of BQ tables and views has grown over time. The enumeration below documents the BQ tables and views as of IDC v14. Some of these tables will not be found in earlier IDC BigQuery datasets.

    dicom_metadata

    Table in BigQuery:

    Each row in the dicom_metadata table holds the DICOM metadata of an instance in the corresponding IDC version. There is a single row for each DICOM instance in the corresponding IDC version, and the columns correspond to the DICOM attributes encountered in the data across all of the ingested instances.

    IDC utilizes the standard capabilities of the Google Healthcare API to extract all of the DICOM metadata from the hosted collections into a single BQ table. Conventions of how DICOM attributes of various types are converted into BQ form are covered in the Google Healthcare API documentation article.

    dicom_metadata table contains DICOM metadata extract from the files included in the given IDC data release. The amount and variety of the DICOM files grows with the new releases, and the schema of this table reflects the organization of the metadata in each IDC release. Non-sequence attributes, such as Modality or SeriesInstanceUID, once encountered in any one file will result in the corresponding column being introduced to the table schema (i.e., if we have column X in IDC release 11, in all likelihood it will also be present in all of the subsequent releases).

    Sequence DICOM attributes, however, may have content that is highly variable across different DICOM instances (especially in Structured Reports). Those attributes will map to , and it is not unusual to see drastic differences in the corresponding columns of the table between different releases.

    dicom_metadata can be used to conduct detailed explorations of the metadata content, and build cohorts using fine-grained controls not accessible from the IDC portal. Note that the dicom_all table, described below, is probably a better choice for such explorations.

    Due to the existing limitations of Google Healthcare API, not all of the DICOM attributes are extracted and are available in BigQuery tables. Specifically:

    • sequences that have more than 15 levels of nesting are not extracted (see ) - we believe this limitation does not affect the data stored in IDC

    • sequences that contain around 1MiB of data are dropped from BigQuery export and RetrieveMetadata output currently. 1MiB is not an exact limit, but it can be used as a rough estimate of whether or not the API will drop the tag (this limitation was not documented as of writing this) - we know that some of the instances in IDC will be affected by this limitation. The fix for this limitation is targeted for sometime in 2021, according to the communication with Google Healthcare support.

    auxiliary_metadata

    Table in BigQuery:

    This table defines the contents of the corresponding IDC version. There is a row for each instance in the version. We group the attributes for convenience:

    Collection attributes:

    • tcia_api_collection_id: The ID, as accepted by the TCIA API, of the original data collection containing this instance (will be Null for collections not sourced from TCIA)

    • idc_webapp_collection_id: The ID, as accepted by the IDC web app, of the original data collection containing this instance

    • collection_id: The ID, as accepted by the IDC web app. Duplicate of idc_webapp_collection_id

    Patient attributes:

    • submitter_case_id:The Patient ID assigned by the submitter of this data. This is the same as the DICOM PatientID

    • idc_case_id:IDC generated UUID that uniquely identifies the patient containing this instance

      This is needed because DICOM PatientIDs are not required to be globally unique

    • patient_hash: md5 hash of this version of the patient/case containing this instance

    Study attributes:

    • StudyInstanceUID: DICOM UID of the study containing this instance

    • study_uuid:IDC assigned UUID that identifies a version the the study containing this instance.

    • study_instances: The number instances in the study containing this instance

    Series attributes:

    • SeriesInstanceUID: DICOM UID of the series containing this instance

    • series_uuid:IDC assigned UUID that identifies the version of the series containing this instance

    • source_doi:The DOI of an information page corresponding to the original data collection or analysis results that is the source of this instance

    Instance attributes:

    • SOPInstanceUID: DICOM UID of this instance.

    • instance_uuid:IDC assigned UUID that identifies the version of this instance.

    • gcs_url: The GCS URL of a file containing the version of this instance that is identified by this series_uuid/instance_uuid

    mutable_metadata

    Table in BigQuery:

    Some non-DICOM metadata may change over time. This includes the GCS and AWS URLs of instance data, the accessibility of each instance and the URL of an instance's associated description page. BigQuery metadata tables such as the auxiliary_metadata and dicom_all tables are never revised even when such metadata changes. However, tables in the datasets of previous IDC versions can be joined with the mutable_metadata table to obtain the current values of these mutable attributes.

    The table has one row for each version of each instances:

    • crdc_instance_uuid: The uuid of an instance version

    • crdc_series_uuid: The uuid of a series version that contains this instance version

    • crdc_study_uuid: The uuid of a study version that contains the series version

    original_collections_metadata

    Table in BigQuery:

    This table is comprised of IDC data collection-level metadata for the original TCIA data collections hosted by IDC, for the most part corresponding to the content available in . One row per collection:

    • tcia_api_collection_id: The collection ID as is accepted by the TCIA AP

    • tcia_wiki_collection_id: The collection ID as on the TCIA wiki page

    • idc_webapp_collection_id:The collection ID as accepted by the IDC web app

    analysis_results_metadata

    Table in BigQuery:

    Metadata for the TCIA analysis results hosted by IDC, for the most part corresponding to the content available in . One row per analysis result:

    • ID: Results ID

    • Title: Descriptive title

    • DOI:DOI that can be resolved at doi.org to the TCIA wiki page for this analysis result

    version_metadata

    Table in BigQuery:

    Metadata for each IDC version, one row per version:

    • idc_version: IDC version number

    • version_hash: MD5 hash of hashes of collections in this version

    • version_timestamp: Version creation timestamp

    The following tables and views consist of metadata derived from one or more other IDC tables tables for convenience of the user. For each such table, <table_name>, there is also a corresponding view, <table_name>_view, that, when queried, generates an equivalent table. These views are intended as a reference; each view's SQL is available to be used for further investigation.

    Several of these tables/views are discussed more completely .

    dicom_all, dicom_all_view

    Table in BigQuery:

    Table in BigQuery:

    All columns from dicom_metadata together with selected date from the auxiliary_metadata, original_collections_metadata, and analysis_results_metadata tables.

    segmentations, segmentations_view

    Table in BigQuery:

    Table in BigQuery:

    This table is derived from dicom_all to simplify access to the attributes of DICOM Segmentation objects available in IDC. Each row in this table corresponds to one DICOM Segmentation instance segment.

    measurement_groups, measurement_groups_view

    Table in BigQuery:

    Table in BigQuery:

    This table is derived from dicom_all to simplify access to the measurement groups encoded in DICOM Structured Report TID 1500 objects available in IDC. Specifically, this table contains measurement groups corresponding to the "Measurement group" content item in the DICOM SR objects.

    Each row corresponds to one TID1500 measurement group.

    qualitative_measurements, qualitative_measurements_view

    Table in BigQuery:

    Table in BigQuery:

    This table is derived from dicom_all to simplify access to the qualitative measurements in DICOM SR TID1500 objects. It contains coded evaluation results extracted from the DICOM SR TID1500 objects. Each row in this table corresponds to a single qualitative measurement extracted.

    quantitative_measurements, quantitative_measurements_view

    Table in BigQuery:

    Table in BigQuery:

    This table is derived from dicom_all to simplify access to the quantitative measurements in DICOM SR TID1500 objects. It contains quantitative evaluation results extracted from the DICOM SR TID1500 objects. Each row in this table corresponds to a single quantitative measurement extracted.

    dicom_metadata_curated, dicom_metadata_curated_view

    Table in BigQuery:

    Table in BigQuery:

    Curated values of DICOM metadata extracted from dicom_metadata.

    dicom_metadata_curated_series_level, dicom_metadata_curated_series_level_view

    Table in BigQuery:

    Table in BigQuery:

    Curated columns from dicom_metadata that have been aggregated/cleaned up to describe content at the series level. Each row in this table corresponds to a DICOM instance in IDC. The columns are curated by defining queries that apply transformations to the original values of DICOM attributes.

    idc_pivot_v<idc version>

    A view that is the basis for the queries performed by the IDC web app.

    Collection-specific BigQuery tables

    Most clinical data is found in the . However, a few tables of clinical data are found in the idc_v<idc_version> datasets.

    TCGA

    The following tables contain TCGA-specific metadata:

    • tcga_biospecimen_rel9: biospecimen metadata

    • tcga_clinical_rel9: clinical metadata

    NLST

    IDC hosts a subset of the NLST clinical data, which was cleared for public sharing. If you need the full clinical data, please visit the .

    The following tables contain NLST specific metadata. The detailed schema of those tables is available from the .

    • nlst_canc: "Lung Cancer"

    • nlst_ctab: "SCT Abnormalities"

    • nlst_ctabc: "SCT Comparison Abnormalities"

    Directly loading DICOM objects from Google Cloud or AWS in Python

    DICOM files in the IDC are stored as "blobs" on the cloud, with one copy housed on Google Cloud Storage (GCS) and another on Amazon Web Services (AWS) S3 storage. By using the right tools, these blobs can be wrapped to appear as "file-like" objects to Python DICOM libraries, enabling intelligent loading of DICOM files directly from cloud storage as if they were local files without having to first download them onto a local drive.

    Code snippets included in this article are also replicated in this Google Colab tutorial notebook for you convenience:

    collection_timestamp: Datetime when the IDC data in the collection was last revised

  • collection_hash: md5 hash of the of this version of the collection containing this instance

  • collection_init_idc_version: The IDC version in which the collection containing this instance first appeared

  • collection_revised_idc_version: The IDC version in which this version of the collection containing this instance first appeared

  • patient_init_idc_version: The IDC version in which the patient containing this instance first appeared

  • patient_revised_idc_version: The IDC version in which this version of the patient/case containing this instance first appeared

  • study_hash: md5 hash of the data in this version of the study containing this instance

  • study_init_idc_version: The IDC version in which the study containing this instance first appeared

  • study_revised_idc_version: The IDC version in which this version of the study containing this instance first appeared

  • source_url:The URL of an information page that describes the original collection or analysis result that is the source of this instance

  • series_instances: The number of instances in the series containing this instance

  • series_hash: md5 hash of the data in the this version of the series containing this instance

  • access: Collection access status: 'Public' or 'Limited'. (Currently all data is 'Public')

  • series_init_idc_version: The IDC version in which the series containing this instance first appeared

  • series_revised_idc_version: The IDC version in which this version of the series containing this instance first appeared

  • aws_url: The AWS URL of a file containing the version of this instance that is identified by this series_uuid/instance_uuid

  • instance_hash: the md5 hash of this version of this instance

  • instance_size: the size, in bytes, of this version of this instance

  • instance_init_idc_version: The IDC version in which this instance first appeared

  • instance_revised_idc_version: The IDC version in which this version of this instance first appeared

  • license_url: The URL of a web page that describes the license governing this version of this instance

  • license_long_name: A long form name of the license governing this version of this instance

  • license_short_name: A short form name of the license governing this version of this instance

  • gcs_url: URL to the Google Cloud Storage (GCS) object containing this instance version
  • aws_url: URL to the Amazon Web Services (AWS) object containing this instance version

  • `access: Current access status of this instance (Public or Limited)

  • source_url: The URL of a page that describes the original collection or analysis result that includes this instance

  • source_doi: The DOI of a page that describes the original collection or analysis result that includes this instance

  • Program: The program to which this collection belongs
  • Updated: Most recent update date reported by the collection source

  • Status:Collection status: "Ongoing" or "Complete"

  • Access:Collection access conditions: "Limited" or "Public"

  • ImageType: Enumeration of image types/modalities in the collection

  • Subjects:Number of subjects in the collection

  • DOI:DOI that can be resolved at doi.org to the TCIA wiki page for this collection

  • URL:URL of an information page for this collection

  • CancerType:Collection source(s) assigned cancer type of this collection

  • SupportingData:Type(s) of additional data available

  • Species: Species of collection subjects

  • Location:Body location that was studied

  • Description: Description of the collection (HTML format)

  • license_url: The URL of a web page that describes the license governing this collection

  • license_long_name: A long form name of the license governing this collection

  • license_short_name: A short form name of the license governing this collection

  • CancerType:TCIA assigned cancer type of this analysis result
  • Location:Body location that was studied

  • Subjects:Number of subjects in the analysis result

  • Collections: Original collections studied

  • AnalysisArtifactsonTCIA: Type(s) of analysis artifacts generated

  • Updated: Data when results were last updated

  • license_url: The URL of a web page that describes the license governing this collection

  • license_long_name: A long form name of the license governing this collection

  • license_short_name: A short form name of the license governing this collection

  • description: Description of analysis result

  • nlst_prsn: "Participant"
  • nlst_screen: "SCT Screening"

  • clinical data documentation
    this article
    dicom_all_view
    dicom_metadata
    Understanding the BigQuery DICOM schema
    STRUCT BQ SQL type
    https://cloud.google.com/bigquery/docs/nested-repeated
    auxiliary_metadata
    mutable_metadata
    original_collections_metadata
    this table at TCIA
    analysis_results_metadata
    this table at TCIA
    version_metadata
    here
    dicom_all
    dicom_all_view
    segmentations
    segmentations_view
    measurement_groups
    measurement_groups_view
    TID 1500 Measurement report
    qualitative_measurements
    qualitative_measurements_view
    quantitative_measurements
    quantitative_measurements_view
    dicom_metadata_curated
    dicom_metadata_curated_view
    dicom_metadata_curated_series_level
    dicom_metadata_curated_series_level_view
    idc_v<idc_version>_clinical datasets
    Cancer Data Access System (CDAS) system
    TCIA NLST collection page
    Some of the BigQuery datasets curated by IDC.
    In this example, dicom_all_view is a BQ view, as indicated by the icon to the left from the table name. dicom_all table is the result of running the query that defines the dicom_all_view.
    Yes, you can view the query of a view!
    Reading files with Pydicom

    Pydicom is popular library for working with DICOM files in Python. Its dcmread function is able to accept any "file-like" object, meaning you can read a file straight from a cloud blob if you know its path. See this page for information on finding the paths of the blobs for DICOM objects in IDC. The dcmread function also has some other options that allow you to control what is read. For example you can choose to read only the metadata and not the pixel data, or read only certain attributes. In the following two sections, we demonstrate these abilities using first Google Cloud Storage blobs and then AWS S3 blobs.

    Mapping IDC DICOM series to bucket URLs

    All of the image data available from IDC is replicated between public Google Cloud Storage (GCS) and AWS buckets. pip-installable idc-index package provides convenience functions to get URLs of the files corresponding to a given DICOM series.

    From Google Cloud Storage blobs

    The official Python SDK for Google Cloud Storage (installable from pip and PyPI as google-cloud-storage) provides a "file-like" interface allowing other Python libraries, such as Pydicom, to work with blobs as if they were "normal" files on the local filesystem.

    To read from a GCS blob with Pydicom, first create a storage client and blob object, representing a remote blob object stored on the cloud, then simply use the .open('rb') method to create a readable file-like object that can be passed to the dcmread function.

    Reading only metadata or only specific attributes will reduce the amount of data that needs to be pulled down under some circumstances and therefore make the loading process faster. This depends on the size of the attributes being retrieved, the chunk_size (a parameter of the open() method that controls how much data is pulled in each HTTP request to the server), and the position of the requested element within the file (since it is necessary to seek through the file until the requested attributes are found, but any data after the requested attributes need not be pulled). If you are not retrieving entire images, we strongly recommend specifying a chunk_size (in bytes) because the default value is around 40MB, which is typically far larger than the optimal value for accessing metadata attributes or individual frames (see later).

    This works because running the open method on a Blob object returns a BlobReader object, which has a "file-like" interface (specifically the seek, read, and tell methods).

    From AWS S3 blobs

    The s3fs package provides "file-like" interface for accessing S3 blobs. It can be installed with pip install s3fs. The following example repeats the above example using the counterpart of the same blob on AWS S3.

    Similar to the chunk_size parameter in GCS, the default_block_size is crucially important for determining how efficient this is. Its default value is around 50MB, which will result in orders of magnitude more data than necessary being pulled than is needed to retrieve metadata. In the above example, we set it to 50kB.

    Frame-level access with Highdicom

    Highdicom is a higher-level library providing several features to work with images and image-derived DICOM objects. As of the release 0.25.1, its various reading methods (including imread, segread, annread, and srread) can read any file-like object, including Google Cloud blobs and S3 blobs opened with s3fs.

    A particularly useful feature when working with blobs is "lazy" frame retrieval for images and segmentations. This downloads only the image metadata when the file is initially loaded, uses it to create a frame-level index, and downloads specific frames as and when they are requested by the user. This is especially useful for large multiframe files (such as those found in slide microscopy or multi-segment binary or fractional segmentations) as it can significantly reduce the amount of data that needs to be downloaded to access a subset of the frames.

    In this first example, we use lazy frame retrieval to load only a specific spatial patch from a large whole slide image from the IDC using GCS.

    Running this code should produce an output that looks like this:

    Screenshot of slide region

    The next example repeats this on the same image in AWS S3:

    In both cases, we set the chunk_size/default_block_size to around 500kB, which should be enough to ensure each frame can be retrieved in a single request while minimizing further unnecessary data retrieval.

    As a further example, we use lazy frame retrieval to load only a specific set of segments from a large multi-organ segmentation of a CT image in the IDC stored in binary format (in binary segmentations, each segment is stored using a separate set of frames) using GCS.

    See this page for more information on highdicom's Image class, and this page for the Segmentation class.

    The importance of offset tables for slide microscopy (SM) images

    Achieving good performance for the Slide Microscopy frame-level retrievals requires the presence of either a "Basic Offset Table" or "Extended Offset Table" in the file. These tables specify the starting positions of each frame within the file's byte stream. Without an offset table being present, libraries such as highdicom have to parse through the pixel data to find markers that tell it where frame boundaries are, which involves pulling down significantly more data and is therefore very slow. This mostly eliminates the potential speed benefits of frame-level retrieval. Unfortunately there is no simple way to know whether a file has an offset table without downloading the pixel data and checking it. If you find that an image takes a long time to load initially, it is probably because highdicom is constucting the offset table itself because it wasn't included in the file.

    Most IDC images do include an offset table, but some of the older pathology slide images do not. This page contains some notes about whether individual collections include offset tables.

    You can also check whether an image file (including pixel data) has an offset table using pydicom like this:

    To do this from a remote Google Cloud Storage blob without needing to pull all the pixel data, you can do something like this:

    https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/advanced_topics/gcs_aws_direct_access.ipynb

    Files and metadata

    Limited access content

    As discussed in this community forum post, TCIA made the decision to pull a subset of data from public access collections to limited access. At the moment, we still keep those files that used to be public in IDC before the decision made by TCIA, and the metadata for those files is still accessible in our BigQuery tables, but you cannot download those “Limited” access files referenced by gcs_url from IDC.

    As discussed in this post the issue will manifest itself in an error accessing gcs_url that corresponds to a non-public file:

    has a column named access , which takes values Public or Limited that define if the file corresponding to the instance can be accessed. For all practical purposes, if you interact with the IDC BigQuery tables, you should make sure you exclude “Limited” access items using the following clause in your query:

    In a future release of IDC we will by default exclude limited access items from what you select in the portal, so the portal selection should be more intuitive. But if you access the data via BigQuery queries you will need to know that “Limited” are not accessible and account for this in your query.Storage Buckets

    Storage Buckets are basic containers in Google Cloud that provide storage for data objects (you can read more about the relevant terms in the Google Cloud Storage documentation

    All IDC DICOM file data for all IDC data versions and all of the are maintained in Google Cloud Storage (GCS). Currently all DICOM files are maintained in GCS buckets that allow for free egress within or out of the cloud, enabled through the partnership of IDC with .

    The object namespace is flat, where every object name is composed of a standard format CRDC UUID and with the ".dcm" file extension, e.g. 905c82fd-b1b7-4610-8808-b0c8466b4dee.dcm. For example, that instance can be accessed using as gs://idc-open/905c82fd-b1b7-4610-8808-b0c8466b4dee.dcm

    You can read about accessing GCP storage buckets from a Compute VM .

    Egress of IDC data out of the cloud is free, since IDC data is participating in !

    Typically, the user would not interact with the storage buckets to select and copy files (unless the intent is to copy the entire content hosted by IDC). Instead, one should use either the IDC Portal or IDC BigQuery tables containing file metadata, to identify items of interest and define a cohort. The cohort manifest generated by the IDC Portal can include both the Google Storage URLs for the corresponding files in the bucket, and the , which can be resolved to the Google Storage URLs to access the files.

    Assuming you have a list of GCS URLs in a file gcs_paths.txt, you can download the corresponding items using the command below, substituting $PROJECT_ID with the valid GCP Project ID (see the complete example in ):

    BigQuery Tables

    Google is a massively-parallel analytics engine ideal for working with tabular data. Data stored in BQ can be accessed using queries.

    The flat address space of IDC DICOM objects in GCS storage is accompanied by BigQuery tables that allow the researcher to reconstruct the DICOM hierarchy as it exists for any given version. There are also several BQ tables and views in which we keep copies of the metadata exposed via the TCIA interface at the time a version was captured and other pertinent information.

    There is an instance of each of the following tables and views per IDC version. The set of tables and views corresponding to an IDC version are collected in a single BQ dataset per IDC version, bigquery-public-data.idc_<idc_version_number> where bigquery-public-data is the project in which the dataset is hosted. As an example, the BQ tables for IDC version 4 are in the bigquery-public-data.idc_v4dataset.

    In addition to the per-version datasets, the bigquery-public-data.idc-current dataset consists of a set of BQ views. There is a view for each table or view in the BQ data set corresponding to the current IDC release. Each such view in bigquery-public-data.idc-current is named identically to some table or view in the bigquery-public-data.idc_<idc_version_number> dataset of the current IDC release and can be used to access that table or view.

    Several Google BigQuery (BQ) tables support searches against metadata extracted from the data files. Additional BQ tables define the composition of each IDC data version.

    We maintain several additional tables that curate metadata non-DICOM metadata (e.g., attribution of a given item to a specific collection and DOI, collection-level metadata, etc).

    • bigquery-public-data.idc_v<idc_version_number>.auxiliary_metadata (also available via the view.) This table defines the contents of the corresponding IDC version. There is a row for each instance in the version. Collection attributes:

      • tcia_api_collection_id: The ID, as accepted by the TCIA API, of the original data collection containing this instance

      • idc_webapp_collection_id:

    Due to the existing limitations of Google Healthcare API, not all of the DICOM attributes are extracted and are available in BigQuery tables. Specifically:

    • sequences that have more than 15 levels of nesting are not extracted (see ) - we believe this limitation does not affect the data stored in IDC

    • bigquery-public-data.idc_v<idc_version_number>.original_collections_metadata (also available via the view) This table is comprised of IDC data Collection-level metadata for the original TCIA data collections hosted by IDC, for the most part corresponding to the content available in . One row per collection:

      • tcia_api_collection_id: The collection ID as is accepted by the TCIA AP

    The following (virtual tables defined by queries) extract specific subsets of metadata, or combine attributes across different tables, for convenience of the users

    • bigquery-public-data.idc_v<idc_version_number>.dicom_all (also available via view for the current version of IDC data) DICOM metadata together with selected auxiliary and collection metadata

    • bigquery-public-data.idc_v<idc_version_number>.segmentations (also available via view for the current version of IDC data) Attributes of the segments stored in DICOM Segmentation objects

    • bigquery-public-data.idc_v<idc_version_number>.measurement_groups

    The following tables contain TCGA-specific metadata:

    • tcga_biospecimen_rel9: biospecimen metadata

    • tcga_clinical_rel9: clinical metadata

    Collection-specific BigQuery tables

    Some of the collections are accompanied by BigQuery tables that have not been harmonized to a single data model. Those tables are available within the BigQuery dataset corresponding to a given release, and will have the name prefix corresponding to the short name of the collection. The list below discusses those collection-specific tables.

    NLST

    IDC hosts a subset of the NLST clinical data, which was cleared for public sharing. If you need the full clinical data, please visit the .

    The following tables contain NLST specific metadata. The detailed schema of those tables is available from the .

    • ``: "Lung Cancer"

    • ``: "SCT Abnormalities"

    • ``: "SCT Comparison Abnormalities"

    • ``

    DICOM Stores

    IDC utilizes a single Google Healthcare DICOM store to host all of the instances in the current IDC version. That store, however, is primarily intended to support visualization of the data using OHIF Viewer. At this time, we do not support access of the hosted data via DICOMWeb interface by the IDC users. See more details in the , and please comment about your use case if you have a need to access data via the DICOMweb interface.

    BigQuery tables external to IDC

    In addition to the DICOM data, some of the image-related data hosted by IDC is stored in additional tables. These include the following:

    • BigQuery TCGA clinical data: . Note that this table is hosted under the ISB-CGC Google project, as documented , and its location may change in the future!

    from idc_index import IDCClient
    
    
    # Create IDCClient for looking up bucket URLs
    idc_client = IDCClient()
    
    # Get the list of GCS file URLs in Google bucket from SeriesInstanceUID
    gcs_file_urls = idc_client.get_series_file_URLs(
        seriesInstanceUID="1.3.6.1.4.1.14519.5.2.1.131619305319442714547556255525285829796",
        source_bucket_location="gcs",
    )
    
    # Get the list of AWS file URLs in AWS bucket from SeriesInstanceUID
    aws_file_urls = idc_client.get_series_file_URLs(
        seriesInstanceUID="1.3.6.1.4.1.14519.5.2.1.131619305319442714547556255525285829796",
        source_bucket_location="aws",
    )
    from pydicom import dcmread
    from pydicom.datadict import keyword_dict
    from google.cloud import storage
    from idc_index import IDCClient
    
    
    # Create IDCClient for looking up bucket URLs
    idc_client = IDCClient()
    
    # Create a client and bucket object representing the IDC public data bucket
    gcs_client = storage.Client.create_anonymous_client()
    
    # This example uses a CT series in the IDC.
    # get the list of file URLs in Google bucket from the SeriesInstanceUID
    file_urls = idc_client.get_series_file_URLs(
        seriesInstanceUID="1.3.6.1.4.1.14519.5.2.1.131619305319442714547556255525285829796",
        source_bucket_location="gcs",
    )
    
    # URLs will look like this:
    # s3://idc-open-data/668029cf-41bf-4644-b68a-46b8fa99c3bc/f4fe9671-0a99-4b6d-9641-d441f13620d4.dcm
    (_, _, bucket_name, folder_name, file_name) = file_urls[0].split("/")
    blob_key = f"{folder_name}/{file_name}"
    
    # These objects represent the bucket and a single image blob within the bucket
    bucket = gcs_client.bucket(bucket_name)
    blob = bucket.blob(blob_key)
    
    # Read the whole file directly from the blob
    with blob.open("rb") as reader:
        dcm = dcmread(reader)
    
    # Read metadata only (no pixel data)
    with blob.open("rb", chunk_size=5_000) as reader:
        dcm = dcmread(reader, stop_before_pixels=True)
    
    # Read only specific attributes, identified by their tag
    # (here the Manufacturer and ManufacturerModelName attributes)
    with blob.open("rb", chunk_size=5_000) as reader:
        dcm = dcmread(
            reader,
            specific_tags=[keyword_dict['Manufacturer'], keyword_dict['ManufacturerModelName']],
        )
        print(dcm)
    import s3fs
    from pydicom import dcmread
    from pydicom.datadict import keyword_dict
    
    from idc_index import IDCClient
    
    
    # Create IDCClient for looking up bucket URLs
    idc_client = IDCClient()
    
    # This example uses a CT series in the IDC (same as above).
    # Get the list of file URLs in AWS bucket from SeriesInstanceUID
    file_urls = idc_client.get_series_file_URLs(
        seriesInstanceUID="1.3.6.1.4.1.14519.5.2.1.131619305319442714547556255525285829796",
        source_bucket_location="aws",
    )
    
    # Configure a client to avoid the need for AWS credentials
    s3_client = s3fs.S3FileSystem(
      anon=True,  # no credentials needed to access public data
      default_block_size=50_000,  # ~50kB data pulled in each request
      use_ssl=False  # disable encryption for a speed boost
    )
    
    with s3_client.open(file_urls[0], 'rb') as reader:
        dcm = dcmread(reader)
    
    # Read metadata only (no pixel data)
    with s3_client.open(file_urls[0], 'rb') as reader:
        dcm = dcmread(reader, stop_before_pixels=True)
    
    # Read only specific attributes, identified by their tag
    # (here the Manufacturer and ManufacturerModelName attributes)
    with s3_client.open(file_urls[0], 'rb') as reader:
        dcm = dcmread(
            reader,
            specific_tags=[keyword_dict['Manufacturer'], keyword_dict['ManufacturerModelName']],
        )
        print(dcm)
    import numpy as np
    import highdicom as hd
    import matplotlib.pyplot as plt
    from google.cloud import storage
    from pydicom import dcmread
    
    from idc_index import IDCClient
    
    # Create IDCClient for looking up bucket URLs
    idc_client = IDCClient()
    
    # Install additional component of idc-index to resolve SM instances to file URLs
    idc_client.fetch_index("sm_instance_index")
    
    # Given SeriesInstanceUID of an SM series, find the instance that corresponds to the
    # highest resolution base layer of the image pyramid
    query = """
    SELECT SOPInstanceUID, TotalPixelMatrixColumns
    FROM sm_instance_index
    WHERE SeriesInstanceUID = '1.3.6.1.4.1.5962.99.1.1900325859.924065538.1719887277027.4.0'
    ORDER BY TotalPixelMatrixColumns DESC
    LIMIT 1
    """
    result = idc_client.sql_query(query)
    
    # Get URL corresponding to the base layer instance in the Google Storage bucket
    base_layer_file_url = idc_client.get_instance_file_URL(
        sopInstanceUID=result.iloc[0]["SOPInstanceUID"],
        source_bucket_location="gcs"
    )
    
    # Create a storage client and use it to access the IDC's public data bucket
    gcs_client = storage.Client.create_anonymous_client()
    
    (_,_, bucket_name, folder_name, file_name) = base_layer_file_url.split("/")
    blob_key = f"{folder_name}/{file_name}"
    
    bucket = gcs_client.bucket(bucket_name)
    base_layer_blob = bucket.blob(blob_key)
    
    # Read directly from the blob object using lazy frame retrieval
    with base_layer_blob.open(mode="rb", chunk_size=500_000) as reader:
        im = hd.imread(reader, lazy_frame_retrieval=True)
    
        # Grab an arbitrary region of tile full pixel matrix
        region = im.get_total_pixel_matrix(
            row_start=15000,
            row_end=15512,
            column_start=17000,
            column_end=17512,
            dtype=np.uint8
        )
    
    # Show the region
    plt.imshow(region)
    plt.show()
    import numpy as np
    import highdicom as hd
    import matplotlib.pyplot as plt
    from pydicom import dcmread
    import s3fs
    
    from idc_index import IDCClient
    
    # Create IDCClient for looking up bucket URLs
    idc_client = IDCClient()
    
    # Install additional component of idc-index to resolve SM instances to file URLs
    idc_client.fetch_index("sm_instance_index")
    
    # Given SeriesInstanceUID of an SM series, find the instance that corresponds to the
    # highest resolution base layer of the image pyramid
    query = """
    SELECT SOPInstanceUID, TotalPixelMatrixColumns
    FROM sm_instance_index
    WHERE SeriesInstanceUID = '1.3.6.1.4.1.5962.99.1.1900325859.924065538.1719887277027.4.0'
    ORDER BY TotalPixelMatrixColumns DESC
    LIMIT 1
    """
    result = idc_client.sql_query(query)
    
    # Get URL corresponding to the base layer instance in the AWS S3 bucket
    base_layer_file_url = idc_client.get_instance_file_URL(
        sopInstanceUID=result.iloc[0]["SOPInstanceUID"],
        source_bucket_location="aws"
    )
    
    # Create a storage client and use it to access the IDC's public data bucket
    # Configure a client to avoid the need for AWS credentials
    s3_client = s3fs.S3FileSystem(
      anon=True,  # no credentials needed to access pubilc data
      default_block_size=500_000,  # ~500kB data pulled in each request
      use_ssl=False  # disable encryption for a speed boost
    )
    
    # Read directly from the blob object using lazy frame retrieval
    with s3_client.open(base_layer_file_url, 'rb') as reader:
        im = hd.imread(reader, lazy_frame_retrieval=True)
    
        # Grab an arbitrary region of tile full pixel matrix
        region = im.get_total_pixel_matrix(
            row_start=15000,
            row_end=15512,
            column_start=17000,
            column_end=17512,
            dtype=np.uint8
        )
    
    # Show the region
    plt.imshow(region)
    plt.show()
    import highdicom as hd
    from google.cloud import storage
    from idc_index import IDCClient
    
    
    # Create IDCClient for looking up bucket URLs
    idc_client = IDCClient()
    
    # Get the file URL corresponding to the segmentation of a CT series
    # containing a large number of different organs - the same one as used in the
    # IDC Portal front page
    file_urls = idc_client.get_series_file_URLs(
        seriesInstanceUID="1.2.276.0.7230010.3.1.3.313263360.15787.1706310178.804490",
        source_bucket_location="gcs"
    )
    
    (_, _, bucket_name, folder_name, file_name) = file_urls[0].split("/")
    
    # Create a storage client and use it to access the IDC's public data package
    gcs_client = storage.Client.create_anonymous_client()
    bucket = gcs_client.bucket(bucket_name)
    
    blob_name = f"{folder_name}/{file_name}"
    blob = bucket.blob(blob_name)
    
    # Open the blob with "segread" using the "lazy frame retrieval" option
    with blob.open(mode="rb", chunk_size=500_000) as reader:
        seg = hd.seg.segread(reader, lazy_frame_retrieval=True)
    
        # Find the segment number corresponding to the liver segment
        selected_segment_numbers = seg.get_segment_numbers(segment_label="Liver")
    
        # Read in the selected segments lazily
        volume = seg.get_volume(
            segment_numbers=selected_segment_numbers,
            combine_segments=True,
        )
    
    # Print dimensions of the liver segment volume
    print(volume.shape)
    import pydicom
    
    
    dcm = pydicom.dcmread("...")  # Any method to read from file/cloud storage
    
    
    if not dcm.file_meta.TransferSyntaxUID.is_encapsulated:
        print(
            "This image does not use an encapsulated (compressed) transfer "
            "syntax, so offset tables are not required."
        )
    else:
        # Check metadata for the extended offset table
        print("Has Extended Offset Table:", "ExtendedOffsetTable" in dcm)
    
        # The start of the PixelData element will be a 4 byte item tag for the offset table,
        # which should always be present. The following 4 bytes gives the length of the offset
        # table. If it is non-zero, the offset table is present
        has_basic_offset_table = dcm.PixelData[4:8] != b'\x00\x00\x00\x00'
        print("Has Basic Offset Table:", has_basic_offset_table)
    
    import os
    from pydicom import dcmread
    from google.cloud import storage
    
    
    def check_offset_table(blob_key: str):
        """Print information on the offset table in an IDC blob."""
        # Create a storage client and use it to access the IDC's public data package
        gcs_client = storage.Client.create_anonymous_client()
    
        # Blob object for the particular file you want to check
        blob = gcs_client.bucket("idc-open-data").blob(blob_key)
    
        # Open the blob object for remote reading with a ~500kB chunk size
        with blob.open(mode="rb", chunk_size=500_000) as reader:
            # Read the file with stop_before_pixels=True, this moves the cursor
            # position to the start of the pixel data attribute
            dcm = dcmread(reader, stop_before_pixels=True)
    
            if not dcm.file_meta.TransferSyntaxUID.is_encapsulated:
                print(
                    "This image does not use an encapsulated (compressed) transfer "
                    "syntax, so offset tables are not required."
                )
            else:
                # The presence of the extended offset table in the loaded metadata can be
                # checked straightforwardly
                has_extended_offset_table = "ExtendedOffsetTable" in dcm
                print("Has Extended Offset Table:", has_extended_offset_table)
    
                # Read the next tag, should be the pixel data tag
                tag = reader.read(4)
                assert tag == b'\xe0\x7f\x10\x00', "Expected pixel data tag"
    
                # Skip over VR (2 bytes), reserved (2 bytes), and pixel data length (4
                # bytes), giving 8 bytes total. Refer to
                # https://dicom.nema.org/medical/dicom/current/output/chtml/part05/sect_A.4.html#table_A.4-2
                reader.seek(8, os.SEEK_CUR)
    
                # Read the item tag for the offset table item
                item_tag = reader.read(4)
                assert item_tag == b'\xfe\xff\x00\xe0', "Expected item tag"
    
                # Read the 32bit length of the pixel data's basic offset table
                length = reader.read(4)
    
                # If the length of the offset table is non-zero, the offset table exists
                has_basic_offset_table = (length != b'\x00\x00\x00\x00')
                print("Has Basic Offset Table:", has_basic_offset_table)
    
    
    # Example with no offset table (NLST-LSS collection)
    check_offset_table("4a30ffd2-8489-427b-9a83-03f4cf28534d/ad46e1e3-b37c-434b-a67a-5bacbcc608d9.dcm")
    
    # Example with basic offset table (CCDI-MCI collection)
    check_offset_table("763fe058-7d25-4ba7-9b29-fd3d6c41dc4b/210f0529-c767-4795-9acf-bad2f4877427.dcm")
    
    # Example with extended offset table (CMB-MML collection)
    check_offset_table("79f38b50-4df4-4358-9271-f28aeac573d7/23b9272a-34ef-49ca-833f-84329a18c1e4.dcm")
    AccessDeniedException: 403 <user email> does not have storage.objects.list 
    access to the Google Cloud Storage bucket.
    The ID, as accepted by the IDC web app, of the original data collection containing this instance
  • collection_timestamp: Datetime when the IDC data in the collection was last revised

  • source_doi:A DOI of the TCIA wiki page corresponding to the original data collection or analysis results that is the source of this instance

  • collection_hash: The md5 hash of the sorted patient_hashes of all patients in the collection containing this instance

  • collection_init_idc_version: The IDC version in which the collection containing this instance first appeared

  • collection_revised_idc_version: The IDC version in which the collection containing this instance was most recently revised

  • Patient attributes:

    • submitter_case_id:The submitter’s (of data to TCIA) ID of the patient containing this instance. This is the DICOM PatientID

    • idc_case_id:IDC generated UUID that uniquely identifies the patient containing this instance

      This is needed because DICOM PatientIDs are not required to be globally unique

    • patient_hash: the md5 hash of the sorted study_hashes of all studies in the patient containing this instance

    • patient_init_idc_version: The IDC version in which the patient containing this instance first appeared

    • patient_revised_idc_version: The IDC version in which the patient containing this instance was most recently revised

    Study attributes:

    • StudyInstanceUID: DICOM UID of the study containing this instance

    • study_uuid:IDC assigned UUID that identifies a version of the study containing this instance.

    • study_instances: The number instances in the study containing this instance

    • study_hash: the md5 hash of the sorted series_hashes of all series in study containing this instance

    • study_init_idc_version: The IDC version in which the study containing this instance first appeared

    • study_revised_idc_version: The IDC version in which the study containing this instance was most recently revised

    Series attributes:

    • SeriesInstanceUID: DICOM UID of the series containing this instance

    • series_uuid:IDC assigned UUID that identifies a version of the series containing this instance

    • source_doi:A DOI of the TCIA wiki page corresponding to the original data collection or analysis results that is the source of this instance

    • series_instances: The number of instances in the series containing this instance

    • series_hash: the md5 hash of the sorted instance_hashes of all instance in the series containing this instance

    • series_init_idc_version: The IDC version in which the series containing this instance first appeared

    • series_revised_idc_version: The IDC version in which the series containing this instance was most recently revised

    Instance attributes:

    • SOPInstanceUID: DICOM UID of this instance.

    • instance_uuid:IDC assigned UUID that identifies a version of this instance.

    • gcs_url: The GCS URL of a file containing the version of this instance that is identified by the instance_uuid

    • instance_hash: the md5 hash of the version of this instance that is identified by the instance_uuid

    • instance_size: the size, in bytes, of this version of the instance that is identified by the instance_uuid

    • instance_init_idc_version: The IDC version in which this instance first appeared

    • instance_revised_idc_version: The IDC version in which this instance was most recently revised

    • license_url: The URL of a web page that describes the license governing this instance

    • license_long_name: A long form name of the license governing this instance

    • license_short_name: A short form name of the license governing this instance

  • bigquery-public-data.idc_v<idc_version_number>.dicom_metadata (also available via bigquery-public-data.idc_current.dicom_metadata view for the current version of IDC data) DICOM metadata for each instance in the corresponding IDC version. IDC utilizes the standard capabilities of the Google Healthcare API to extract all of the DICOM metadata from the hosted collections into a single BQ table. Conventions of how DICOM attributes of various types are converted into BQ form are covered in the Understanding the BigQuery DICOM schema Google Healthcare API documentation article. IDC users can access this table to conduct detailed exploration of the metadata content, and build cohorts using fine-grained controls not accessible from the IDC portal. The schema is too large to document here. Refer to the BQ table and the above referenced documentation.

  • sequences that contain around 1MiB of data are dropped from BigQuery export and RetrieveMetadata output currently. 1MiB is not an exact limit, but it can be used as a rough estimate of whether or not the API will drop the tag (this limitation was not documented as of writing this) - we know that some of the instances in IDC will be affected by this limitation. The fix for this limitation is targeted for sometime in 2021, according to the communication with Google Healthcare support.
    tcia_wiki_collection_id: The collection ID as on the TCIA wiki page
  • idc_webapp_collection_id:The collection ID as accepted by the IDC web app

  • Program: The program to which this collection belongs

  • Updated: Moser recent update date reported by TCIA

  • Status:Collection status" Ongoing or complete

  • Access:Collection access conditions: Limited or Public

  • ImageType: Enumeration of image types/modalities in the collection

  • Subjects:Number of subjects in the collection

  • DOI:DOI that can be resolved at doi.org to the TCIA wiki page for this collection

  • CancerType:TCIA assigned cancer type of this collection

  • SupportingData:Type(s) of additional data available

  • Species: Species of collection subjects

  • Location:Body location that was studied

  • Description:TCIA description of the collection (HTML format)

  • license_url: The URL of a web page that describes the license governing this collection

  • license_long_name: A long form name of the license governing this collection

  • license_short_name: A short form name of the license governing this collection

  • bigquery-public-data.idc_v<idc_version_number>.analysis_results_metadata (also available via thebigquery-public-data.idc_current.analysis_results_metadata view for the current version of IDC data) Metadata for the TCIA analysis results hosted by IDC, for the most part corresponding to the content available in this table at TCIA. One row per analysis result:

    • ID: Results ID

    • Title: Descriptive title

    • DOI:DOI that can be resolved at doi.org to the TCIA wiki page for this analysis result

    • CancerType:TCIA assigned cancer type of this analysis result

    • Location:Body location that was studied

    • Subjects:Number of subjects in the analysis result

    • Collections: Original collections studied

    • AnalysisArtifactsonTCIA: Type(s) of analysis artifacts generated

    • Updated: Data when results were last updated

    • license_url: The URL of a web page that describes the license governing this collection

    • license_long_name: A long form name of the license governing this collection

    • license_short_name: A short form name of the license governing this collection

  • cancer-idc.idc_v<version_number>.version_metadata (also available via the canceridc-data.idc-current.version_metadata view for the current version of IDC data). Metadata for each IDC version, one row per row:

    • idc_version: IDC version number

    • version_hash: MD5 hash of hashes of collections in this version

    • version_timestamp: Version creation timestamp

  • (also available via
    ``

    view for the current version of IDC data) Measurement group sequences extracted from the DICOM SR TID1500 objects

  • bigquery-public-data.idc_v<idc_version_number>.qualitative_measurements (also available via bigquery-public-data.idc_current.qualitative_measurements view for the current version of IDC data) Coded evaluation results extracted from the DICOM SR TID1500 objects

  • bigquery-public-data.idc_v<idc_version_number>.quantitative_measurements (also available via bigquery-public-data.idc_current.quantitative_measurements view for the current version of IDC data) Quantitative evaluation results extracted from the DICOM SR TID1500 objects

  • : "Participant"
  • ``nlst_screen: "SCT Screening"

  • bigquery-public-data.idc_current.dicom_all table
    here
    collections hosted by IDC
    Google Public Datasets Program
    gsutil
    here
    Google Public Datasets Program
    CRDC UUIDs
    this notebook
    BigQuery (BQ)
    standard SQL
    bigquery-public-data.idc_current.auxiliary_metadata
    https://cloud.google.com/bigquery/docs/nested-repeated
    bigquery-public-data.idc_current.original_collections_metadata
    this table at TCIA
    BigQuery views
    bigquery-public-data.idc_current.dicom_all
    bigquery-public-data.idc_current.segmentations
    Cancer Data Access System (CDAS) system
    TCIA NLST collection page
    nlst_canc
    nlst_ctab
    nlst_ctabc
    nlst_prsn
    discussion here
    isb-cgc:TCGA_bioclin_v0.clinical_v1
    here
    bigquery-public-data.idc_current.measurement_groups
    SELECT
      ...
    FROM
      `bigquery-public-data.idc_current.dicom_all`
    WHERE
      access <> "Limited"
    $ cat gcs_paths.txt | gsutil -m cp -I .

    Viewer release notes

    The version of the viewer is available from the "About" menu for the OHIF (radiology) viewer, and "Get app info" menu for the Slim (pathology) viewers. Both of those menus are in the upper right corner of the window.

    IDC viewers release notes are maintained via GitHub Releases. Once you identified the version of the deployed viewer using the instructions above, you can locate the corresponding release notes in the following:

    • OHIF v3 release notes:

    • Slim release notes:

      • a significant portion of functionality available in Slim is implemented in the dicom-microscopy-viewer package, its releases are available here:


    v2-legacy - 8 Jun 2023 - OHIF

    The final OHIF v2 published version is 4.12.45. Upstream changes based on v2 will be accessible through the v2-legacy branch (will not be published to NPM).

    Main highlights from v2-legacy since 4.12.45:

    • Fix high and critical dependency issues reported by dependabot

    • Update SEG tolerance popup and update SEG thumbnail warning: Jump to first segment item image and show warning message only once on onChange events

    • Update to issues and PR templates

    • Address segmentation visibility toggle applied to all segmentations instead of the active one only

    4.12.45 - Oct 2022 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Handle missing ReferencedInstanceSequence attribute: Update parsing logic to consider attribute as optional.

    4.12.44 - Oct 2022 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Remove unused code from DICOM SR parsing: Remove referecenedImages attribute from SR display sets. Within TID 1500, sub-template TID 1600 (Image Library) is not required while parsing SR for image references for annotations and planar measurements. The same is obtained from sub template TID 1501>TID 300>TID 320.

    4.12.43 - Oct 2022 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Update message for segmentation error loading due to orientation tolerance

    4.12.42 - Oct 2022 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Correct Parsing Logic for Qualitative Instance Level SR

    4.12.41 - Sep 2022 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Fix 2d MPR rendering issue for the sagittal view

    0.13.0 - April 2023 - Slim

    Slim is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).

    New Features

    • Support configuration of multiple origin servers for different types of DICOM objects (SOP Storage Classes)

    Enhancements

    • Improved error handling

    • Check Pyramid UID (if available) when grouping images into digital slides

    Bug Fixes

    • Use Acquisition UID (if available) to group images into digital slides

    0.11.2 - September 2022 - Slim

    The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).

    Main highlights of this release include:

    New features

    • Add new tool to go to specific slide position;

    • Show mouse position in slide coordinate system.

    Enhancements

    • Improve performance of translations between image and slide coordinates;

    • Automatically adjust size of overview image to size of browser window.

    Bug fixes

    • Fix rendering of label image;

    • Show error message when creation of viewer fails;

    • Fix resolution of overview image;

    • Fix styling of point annotations;

    4.12.40 - September 2022 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Fix parsing of qualitative slice annotation;

    • Disable measurements panel interactions in MPR mode;

    • Fix parsing of segmentation when orientation values are close to zero;

    • Raise error if a frame StudyInstanceUID, SeriesInstanceUID and SOPInstanceUID are not conforming with the UID (DICOM UI VR) character repertoire;

    0.8.1 - June 2022 - Slim

    The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).

    Main highlights of this release include:

    New features

    • Add panel for clinical trial information to case viewer;

    • Sort digital slides by Container Identifier attribute.

    Enhancements

    • Reset style of optical paths to default when deactivating presentation state.

    Bug fixes

    • Fix rendering of ROI annotations by upgrading to React version 1;

    • Correctly update UIDs of visible/active optical paths;

    • Fix type declarations of DICOMweb search resources.

    4.12.30 - June 2022 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Add support for SR qualitative annotation per instance.

    0.7.2 - June 2022 - Slim

    The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).

    Main highlights of this release include:

    New features

    • Support DICOM Advanced Blending Presentation State to parametrize the display of multiplexed IF microscopy images;

    • Add key bindings for annotations tools;

    • Enable configuration of tile preload;

    • Enable configuration of annotation geometry type per finding;

    Enhancements

    • Improve default presentation of multiplexed IF microscopy images in the absence of presentation state instances;

    • Correctly configure DCM4CHEE Archive to use reverse proxy URL prefix for BulkDataURI in served metadata;

    • Enlarge display settings interfaces and add input fields for opacity, VOI limits, and colors;

    • Update dicom-microscopy-viewer version to use web workers for frame decoding/transformation operations;

    Bug fixes

    • Fix parsing of URL path upon redirect after successful authentication/authorization;

    • Fix configuration of optical path display settings when switching between presentation states;

    • Fix caching of presentation states and for selection via drop-down menu.

    Security

    • Update dependencies with critical security issues.

    0.5.1 - April 2022 - Slim

    The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).

    Main highlights of this release include:

    Enhancements

    • Make overview panel collapsible and hide it entirely if lowest-resolution image is too large.

    Bug fixes

    • Fix update of optical path settings when switching between slides.

    4.12.26 - April 2022 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Fix regression in logic for finding segmentations referenced source image;

    • Fix segmentations loading issues;

    • Fix thumbnail series type for unsupported SOPClassUID;

    • Fix toolbar error when getDerivedDatasets finds no referenced series are found.

    0.5.0 - March 2022 - Slim

    The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).

    Main highlights of this release include:

    New features

    • Display of analysis results stored as DICOM Segmentation, Parametric Map, or Microscopy Bulk Simple Annotations instances;

    • Dynamic selection of DICOMweb server by user (can be enabled by setting AppConfig.enableServerSelection to true);

    • Dark app mode for fluorescence microscopy (can be enabled by setting App.mode to "dark");

    Improvements

    • Unify handling of optical paths for color and grayscale images;

    • Add loading indicator;

    • Improve styling of overview map;

    • Render specimen metadata in compacter form;

    Bug fixes

    • Ensure ROI annotations are re-rendered upon modification;

    • Clean up memory and recreate viewers upon page reload;

    • Fix selection of volume images;

    • Fix color space conversion during decoding of JPEG 2000 compressed image frames;

    4.12.22 - March 2022 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Improve logic for finding segmentations referenced source image;

    • Improve debug dialog: fix text overflow and adding active viewports referenced SEGs and RTSTRUCT series.

    4.12.17 - February 2022 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Fix fail to load SEG related to geometry assumptions;

    • Fix fail to load SEG related to tolerance;

    • Add initial support for SR planar annotations.

    0.4.5 - January 2022 - Slim

    The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).

    Main highlights of this release include:

    Bug fixes

    • Fix selection of VOLUME or THUMBNAIL images with different Photometric Interpretation.

    4.12.12 - January 2022 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Fix RTSTRUCT right panel updates;

    • Fix SEG loading regression.

    4.12.7 - December 2021 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Fix handling of datasets with unsupported modalities;

    • Fix backward fetch of images for the current active series.

    • Fix tag browser slider.

    0.4.3 - November 2021 - Slim

    The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).

    Main highlights of this release include:

    Bug fixes

    • Rotate box in overview map outlining the extent of the current view together with the image.

    4.12.5 - November 2021 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Fix segmentation/rtstruct menu badge update when switching current displayed series;

    • Add to series thumbnail link icon if they are connected to any annotation (segmentation, etc...);

    • Fix problems opening series when the study includes many series;

    • Fix segments visibility handler.

    0.4.1 - October 2021 - Slim

    The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).

    Main highlights of this release include:

    Improvements

    • Include images with new flavor THUMBNAIL in image pyramid;

    • Properly fit overview map into HTML element and disable re-centering of overview map when user navigates main map;

    • Allow drawing of ROIs that extent beyond the slide coordinate system (i.e., allow negative ROI coordinates).

    Bug fixes

    • Prevent display of annotation marker when ROI is deactivated

    4.11.2 - October 2021 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Fix issues with segmentation orientations;

    • Fix display of inconsistencies warning for segmentation thumbnails;

    • Fix throttle thumbnail progress updates.

    0.3.1 - September 2021 - Slim

    The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).

    Main highlights of this release include:

    Bug fixes

    • Set PUBLIC_URL in Dockerfile.

    0.3.0 - September 2021 - Slim

    The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).

    Main highlights of this release include:

    Improvements

    • Add button to display information about application and environment;

    • Add ability to include logo;

    • Verify content of SR documents before attempting to load annotations;

    • Improve re-direction after authentication;

    Bug fixes

    • Disable zoom of overview map;

    • Fix pagination of worklist;

    • Prevent delay in tile rendering.

    4.10.1 - September 2021 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Handle uncaught exception for non TID 1500 sr;

    • Added display of badge numbers in the segmentation / rtstruct panel tabs;

    • Study prefetcher with loading bar.

    0.2.0 - August 2021 - Slim

    The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).

    Main highlights of this release include:

    New features

    • Support for multiplexed immunofluorescence slide microscopy imaging;

    • Client-side additive blending of multiple channels using WebGL;

    • Client-side decoding of compressed frame items using WebAssembly based on Emscripten ports of libjpeg-turbo, openjpeg, and charls C/C++ libraries.

    Improvements

    • Continuous integration testing pipeline using circle CI;

    • Deploy previews for manual regression testing.

    Major changes

    • Introduce new configuration parameter renderer.

    4.9.20 - June 2021 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Add exponential backoff and retry after 500 error;

    • Update to HTML SR viewport to display missing header tags.

    0.1.0 - May 2021 - Slim

    The Slim Viewer is a lightweight server-less single-page application for interactive visualization of digital slide microscopy (SM) images and associated image annotations in standard DICOM format. The application is based on the library and can simply be placed in front of a compatible Image Management System (IMS), Picture Archiving and Communication (PACS), or Vendor Neutral Archive (VNA).

    Inital Release.

    4.9.17 - May 2021 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Add disable server cache feature;

    • Additional improvements on series inconsistencies report UI.

    4.9.13 - April 2021 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Add acquisition storage SR sopclass to SR html ext;

    • Fix missing items in the segmentation combobox items at loading;

    • Fix slices are not sorted in geometrical order;

    • Extend series inconsistencies checks to segmentation and improve UI.

    4.9.7 - March 2021 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Add new log service to be used by debugger extension;

    • Add UI to communicate to the users inconsistencies within a single series;

    • Add time in the dates of the items of the segmentation combobox list;

    • Order segmentation combobox list in reverse time order;

    4.8.10 - February 2021 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Replace instance dropdown to slider for dicom tag browser;

    • Add error page and not found pages if failed to retrieve study data.

    4.8.5 - Jannuary 2021 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Add UI error report for MPR buffer limit related errors;

    • Add UI error report for hardware acceleration turned off errors;

    • Add IDC funding acknowledgment;

    • Fix RSTRUCT menu panel undefined variables;

    4.8.0 - December 2020 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Visualize overlapping segments;

    • Use runtime value configuration to get pkg version;

    • Fix navigation issues in the crosshair tool.

    4.5.22 - October 2020 - OHIF

    The OHIF Viewer is a zero-footprint medical image viewer provided by the . It is a configurable and extensible progressive web application with out-of-the-box support for image archives which support .

    Main highlights of this release include:

    • Add MPR crosshair tool.

  • Update dcmjs version so it throws 'Failed to find the reference image in the source data. Cannot load this segmentation' error instead of logging a warning to console

  • Address eye icon for segment not shown when segment name is long

  • Change message for segmentation when it fails to load duo to orientation tolerance

  • Ensure bounding box annotations are axis aligned;

  • Add missing keyboard shortcut for navigation tool.

  • Implements runtime tolerance for SEGs loading retry;

  • Fixed popup notifications behavior;

  • Update cornerstoneWADOImageLoader.

  • Expose equipment metadata in user interface.

  • Add button for user logout;

  • Disable optical path selection when a presentation state has been selected.

  • Support display of segments stored in DICOM Segmentation instances;
  • Support display of parameter mappings stored in DICOM Parametric Map instances;

  • Support display of annotation groups stored in DICOM Microscopy Bulk Simple Annotations instances;

  • Implement color transformations using ICC Profiles to correct color images client side in a browser-independent manner;

  • Implement grayscale transformations using Palette Color Lookup Tables to pseudo-color grayscale images.

  • Improve fetching of WASM library code;

  • Improve styling of slide viewer sidebar;

  • Sort slides by Series Number;

  • Work around common standard compliance issues;

  • Update docker-compose configuration;

  • Upgrade dependencies;

  • Show examples in README;

  • Decode JPEG, JPEG 2000, and JPEG-LS compressed image frames client side in a browser-independent manner;

  • Improve performance of transformation and rendering operations using WebGL for both grayscale as well as color images;

  • Optimize display of overview images and keep overview image fixed when zooming or panning volume images;

  • Optimize HTTP Accept header field for retrieval of frames to work around issues with various server implementations.

  • Fix unit of area measurements for ROI annotations;

  • Publish events when bulkdata loading starts and ends.

  • Add retry logic and error handlers for DICOMweb requests;

  • Improve documentation of application configuration in README;

  • Add unit tests.

  • Fix failure to load a valid SEG object because of incorrect expectations about ReferencedSegmentNumber;

  • Fix RSTRUCT menu visibility when loading a series;

  • Fix image load slowness regression;

  • Fix choppy scrolling in 2D mod;

  • Fix failure to load segmentations when filtering study with '?seriesInstanceUID=' syntax.

  • Fix RSTRUCT menu visibility when loading a series;

  • Fix segments visibility control (SEG menu) bugs .

  • https://github.com/OHIF/Viewers/releases
    https://github.com/ImagingDataCommons/slim/releases
    https://github.com/ImagingDataCommons/dicom-microscopy-viewer/releases
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    dicom-microscopy-viewer
    DICOMweb
    dicom-microscopy-viewer
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    dicom-microscopy-viewer
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    dicom-microscopy-viewer
    DICOMweb
    dicom-microscopy-viewer
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    dicom-microscopy-viewer
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    dicom-microscopy-viewer
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    dicom-microscopy-viewer
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    dicom-microscopy-viewer
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    dicom-microscopy-viewer
    DICOMweb
    dicom-microscopy-viewer
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    dicom-microscopy-viewer
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    dicom-microscopy-viewer
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    Open Health Imaging Foundation (OHIF)
    DICOMweb
    radiology viewer version location
    pathology viewer version location

    Manifests

    Manifests

    A manifest is a table of access methods and other metadata of the objects in some cohort. The POST /cohorts/manifest/preview API endpoint

    The POST /cohorts/manifest/preview API accepts both a fields list, and a cohort definition in the manifestPreviewBody, and returns a manifest. The JSON schema of the manifestBody can be seen on the IDC API v2 UI page. Here is an example:

    As previously mentioned, it behaves as if a cohort is created, a manifest for that cohort is returned and the new cohort is deleted.

    The filters parameter specifies a filter set that defines the cohort.

    The fields parameter of the body indicates the fields whose values are to be included in the returned manifests. The /fields API endpoint returns a list of the fields that can be included in a manifest.

    The counts, group_size, sql and page_size parameters will be described in subsequent sections.

    Every row in the returned manifest will include one value for each of the items in the fields parameter.

    The /cohorts/manifest/preview returns a manifestPreviewResponse JSON object. Here is an example manifestResponse:

    The cohort definition is included so that the manifest is self-documenting. The manifest_data component of the manifest component contains a row for each distinct combination of the requested fields in the cohort.

    Because the /cohorts/manifest/preview API endpoint is always applied against the current IDC version, the idc_data_version in the cohort_def is always that of the current IDC version. This version information can be useful if the cohort_def is saved.

    The totalFound value at the end of the manifest tells us that there are 626 rows in the manifest, meaning the manifest contains 626 different combinations of Modality, SliceThickness, age_at_diagnosis, aws_bucket, and crdc_series uuid.

    The rowsReturned value indicates that all the rows in the manifest were return in the first "page". If not all the rows had been returned, we can ask for additional "pages" as described in the next section.

    The next_page value is described in the next section.

    Groups and group_size

    We use the term group to indicate the set of all instances in the cohort having the values of some row in the manifest. Thus the values of the first row above:

    implicitly define a group of instances in the cohort, each of which has those values.

    When the group_size parameter in the manifestPreviewBody is true, the resulting manifest includes the total size in bytes of the instances in the corresponding group. Following is a fragment of the manifest for the same cohort above, but where the fields list has "group_size": true:

    Here we see that the instances in the group corresponding to the first result row have a total size of 2,690,320B.

    The group_size parameter is optional and defaults to false .

    Manifest granularity

    If the counts parameter is true, the resulting manifest will selectively include counts of the instances, series, studies, patients and collections in each group. Which counts are included in a manifest is determined by the granularity, and which, in turn, is determined by certain of the possible fields in the fields parameter list of the manifestPreviewBody.

    For example, if the fields parameter list includes the SOPInstanceUID field, there will one group per instance in the manifest. Thus the manifest has instance granularity. A manifest has just one of instance, series, study, patient, collection or version granularity.

    For a given manifest granularity, and when counts is True, counts of the "lower level" objects are reported in the manifest. Thus, if a cohort has series granularity, then the count of all instances in each group is reported. If a cohort has study granularity, then the count of all instances in each group and of all series in each group are reported. And so on. This is described in detail in the remainder of this section.

    In the following, manifest examples are based on this filterSet:

    Instance granularity

    A manifest will have instance granularity if the fields parameter list includes one or both of the fields:

    • SOPInstanceUID

    • crdc_instance_uuid

    • gcs_url

    • aws_url

    Both of these fields are unique to each instance. Therefore the resulting manifest will include one row for each instance in the specified cohort. For example, the following fields list will result in a manifest having a row per instance:

    Each row will include the SOPInstanceUID, and the Modality and SliceThickness of the corresponding instance.

    The counts parameter is ignored because there are no 'lower level' objects than instances in the DICOM hierarchy.

    Series granularity

    A manifest will have series granularity if it does not have instance granularity and the fields parameter list includes one or more of thee field:

    • SeriesInstanceUID

    • crdc_series_uuid

    Both of these fields are unique to each series, and therefore the resulting manifest will include at least one row per series in the specified cohort. For example, the following fields list will result in a manifest having one or more rows per series:

    Because the SeriesInstanceUID is unique to each series in a cohort (more accurately, all instances in a series have the same SeriesInstanceUID), there will be at least one row per series in the resulting manifest. However, SliceThickness is not necessarily unique across all instance in a series. Therefore, the resulting manifest may have multiple rows for a given series...rows in which the SeriesInstanceUID is the same but the SliceThickness values differ. DICOM modality should always be the same for all instances in a series; therefore it is not expected to result in multiple rows per series.

    If the counts parameter is true, each row of the manifest will have:

    • an instance_count value that is the count of instances in the group corresponding to the row

    Given the above fields, then this is a fragment of the series granularity manifest of our example cohort:

    This tells us that the group of instances corresponding to the first row of the manifest results has 151 members.

    Study Granularity

    A manifest will have study granularity if it does not have series or instance granularity and the queryFields list includes one or more of the fields:

    • StudyInstanceUID

    • crdc_study_uuid

    Both of these fields are unique to each study, and therefore the resulting manifest will include at least one row per study in the specified cohort. For example, the following fields list will result in a manifest having a one or more rows per study:

    SliceThickness can vary not only among the instances in a series, but among series in a study. Therefore, the resulting manifest may have multiple rows for a study, and which differ from each other in both SliceThickness and Modality.

    If counts is in the fields list, each row of the manifest will have:

    • an instance_count value that is the count of instances in the group corresponding to the row

    • a series_count value that is the count of series in the group corresponding to the row

    If the fields list is as above, then this is a fragment of the study granularity manifest of our example cohort:

    This tells us that the group of instances corresponding to the first row of the manifest results has 212 members, divided among two series. The group of instances corresponding to the third row of the manifest results has two members in a single series.

    Patient Granularity

    A manifest will have patient granularity if it goes not have study, series or instance granularity and the fields list includes the field PatientID. This field is unique to each patient, and therefore the resulting manifest will include at least one row per patient in the specified cohort. For example, the following fields list will result in a manifest having a one or more rows per study:

    Because the PatientID is unique to each patient in a cohort (more accurately, all instances in a study have the same PatientID), there will be at least one row per patient in the resulting manifest. It is common for a patient's series to examine different body parts. Therefore, the resulting manifest may well have more than one row per patient.

    If counts is in the fields list, each row of the manifest will have:

    • an instance_count value that is the count of instances in the group corresponding to the row

    • a series_count value that is the count of series in the group corresponding to the row

    • a study_count value that is the count of studies in the group corresponding to the row

    If the fields list is as above, then this is a fragment of the patient granularity manifest of our example cohort:

    This tells us that the group of instances corresponding to the first row of the manifest results has 212 members divided among two series, and both in a single study.

    Collection Granularity

    A manifest will have collection granularity if it goes not have patient, study, series or instance granularity and the fields parameter list includes the field collection_id. This field is unique to each collection, and therefore the resulting manifest will include at least one row per collection in the specified cohort. For example, the following fields list will result in a manifest having a one or more rows per study:

    Because the collection_id is unique to each collection in a cohort (more accurately, all instances in a collection have the same collection_id), there will be at least one row per collection in the resulting manifest. It is common for a collection to have patients of different ages. Therefore, the resulting manifest may well have more than one row per patient.

    If the fields list is as above, then this is a fragment of the collection granularity manifest of our example cohort:

    Version granularity

    A manifest will have version granularity if it does not have collection, patient, study, series or instance granularity. At this granularity level, the rows in the manifest return the combinations of queried values across all collects, patients, studies, series and instances in the cohort.

    When the fields list is as follows:

    then this is a fragment of the version granularity manifest of our example cohort:

    Row one of the results tells us that the cohort has 212 instances having a Null SliceThickness and modality="CT". Also, there are apparently 87 different combinations of Modality and SliceThickness in the cohort as shown by the totalFound value.

    {
      "cohort_def": {
        "name": "mycohort",
        "description": "Example description",
        "filters": {
          "collection_id": [
            "TCGA_luad",
            "%_kirc"
          ],
          "Modality": [
            "CT",
            "MR"
          ],
          "Race": [
            "WHITE"
          ],
          "age_at_diagnosis_btw": [
            65,
            75
          ]
        }
      },
      "fields": [
        "Age_At_Diagnosis",
        "aws_bucket",
        "crdc_series_uuid",
        "Modality",
        "SliceThickness"
      ],
      "counts": true,
      "group_size": true,
      "sql": true,
      "page_size": 1000
    }
        
    {
      "code": 200,
      "cohort_def": {
        "description": "Example description",
        "user_email": "[email protected]",
        "filterSet": {
          "filters": {
            "Modality": [
              "CT",
              "MR"
            ],
            "age_at_diagnosis_btw": [
              65,
              75
            ],
            "collection_id": [
              "tcga_luad",
              "%_kirc"
            ],
            "race": [
              "WHITE"
            ]
          },
          "idc_data_version": "16.0"
        },
        "name": "mycohort",
      },
      "manifest": {
        "manifest_data": [
          {
            "Modality": "MR",
            "SliceThickness": "10.0",
            "age_at_diagnosis": 66,
            "aws_bucket": "idc-open-data",
            "crdc_series_uuid": "09bc812b-53f7-48fc-8895-72f6b03f642b"
          },
          {
            "Modality": "CT",
            "SliceThickness": "2.5",
            "age_at_diagnosis": 66,
            "aws_bucket": "idc-open-data",
            "crdc_series_uuid": "102d676d-6c6f-4c20-bb36-77ec81b81b13"
          },
          {
            "Modality": "CT",
            "SliceThickness": "8.0",
            "age_at_diagnosis": 66,
            "aws_bucket": "idc-open-data",
            "crdc_series_uuid": "1d365f52-bff4-4348-a508-82d399ca8442"
          },   
          :
          {
            "Modality": "CT",
            "SliceThickness": "1000.090881",
            "age_at_diagnosis": 74,
            "aws_bucket": "idc-open-data",
            "crdc_series_uuid": "faa47e10-45df-44a7-9f8b-2923a41196b4"
          }
        ],
        "rowsReturned": 626,
        "totalFound": 626
      },
      "next_page": ""
    }
    
    "Modality": "MR",
    "SliceThickness": "10.0",
    "age_at_diagnosis": 66,
    "aws_bucket": "idc-open-data",
    "crdc_series_uuid": "09bc812b-53f7-48fc-8895-72f6b03f642b" 
    {
      "code": 200,
      "cohort_def": {
        "description": "Example description",
        "filterSet": {
          "filters": {
            "Modality": [
              "CT",
              "MR"
            ],
            "age_at_diagnosis_btw": [
              65,
              75
            ],
            "collection_id": [
              "tcga_luad",
              "tcga_kirc"
            ],
            "race": [
              "WHITE"
            ]
          },
          "idc_data_version": "16.0"
        },
        "name": "mycohort",
        "sql": ""
      },
      "next_page": "",
      "manifest": {
        "manifest_data": [
          {
            "Modality": "MR",
            "SliceThickness": "10.0",
            "age_at_diagnosis": 66,
            "aws_bucket": "idc-open-data",
            "crdc_series_uuid": "09bc812b-53f7-48fc-8895-72f6b03f642b",
            "group_size": 2690320
          },
          {
            "Modality": "CT",
            "SliceThickness": "2.5",
            "age_at_diagnosis": 66,
            "aws_bucket": "idc-open-data",
            "crdc_series_uuid": "102d676d-6c6f-4c20-bb36-77ec81b81b13",
            "group_size": 42818868
          },
          {
            "Modality": "CT",
            "SliceThickness": "8.0",
            "age_at_diagnosis": 66,
            "aws_bucket": "idc-open-data",
            "crdc_series_uuid": "1d365f52-bff4-4348-a508-82d399ca8442",
            "group_size": 20064536
          },
          :
          :
          {
            "Modality": "CT",
            "SliceThickness": "1000.090881",
            "age_at_diagnosis": 74,
            "aws_bucket": "idc-open-data",
            "crdc_series_uuid": "faa47e10-45df-44a7-9f8b-2923a41196b4",
            "group_size": 6518724
          }
        ],
        "rowsReturned": 626,
        "totalFound": 626
      },
      "next_page": ""
    }
       "filters": {
          "collection_id": [
            "tcga_luad",
            "tcga_kirc"
          ],
          "Modality": [
            "CT",
            "MR"
          ],
          "Race": [
            "WHITE"
          ],
          "age_at_diagnosis_btw": [
            65,
            75
          ]
        }
    
    {
      "fields": [
        "SOPInstanceUID",
        "Modality",
        "SliceThickness"
      ]
    }
    "fields": [
      "Modality",
      "SliceThickness",
      "collection_id",
      "patientID",
      "StudyInstanceUID",
      "SeriesInstanceUID"
    ]
    {
      "code": 200,
      "cohort_def": {
        "description": "Example description",
        "filterSet": {
          "filters": {
            "Modality": [
              "CT",
              "MR"
            ],
            "age_at_diagnosis_btw": [
              65,
              75
            ],
            "collection_id": [
              "tcga_luad",
              "tcga_kirc"
            ],
            "race": [
              "WHITE"
            ]
          },
          "idc_data_version": "16.0"
        },
        "name": "mycohort",
        "sql": ""
      },
      "manifest": {
        "manifest_data": [
          {
            "Modality": "CT",
            "PatientID": "TCGA-50-6592",
            "SeriesInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.9002.141004994853145237754973938025",
            "SliceThickness": null,
            "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.9002.256822832756566055874151999412",
            "collection_id": "tcga_luad",
            "instance_count": "151"
          },
          {
            "Modality": "CT",
            "PatientID": "TCGA-50-6592",
            "SeriesInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.9002.212096199865546132848990878032",
            "SliceThickness": null,
            "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.9002.256822832756566055874151999412",
            "collection_id": "tcga_luad",
            "instance_count": "61"
          },
          {
            "Modality": "CT",
            "PatientID": "TCGA-50-6595",
            "SeriesInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.9002.829269157955398706933292266867",
            "SliceThickness": "0.578125",
            "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.9002.414530650520592976265083061155",
            "collection_id": "tcga_luad",
            "instance_count": "1"
          },
          :
          :
          {
            "Modality": "MR",
            "PatientID": "TCGA-B0-5109",
            "SeriesInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.4004.370888372270096165934432087127",
            "SliceThickness": "20.0",
            "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.4004.167173047835125001355984228239",
            "collection_id": "tcga_kirc",
            "instance_count": "50"
          }
        ],
        "rowsReturned": 742,
        "totalFound": 742
      }
      "next_page": ""
    }
    "fields": [
        "Modality",
        "SliceThickness",
        "collection_id",
        "patientID",
        "StudyInstanceUID",
        "group_size",
        "counts"
    ]
    {
      "code": 200,
      "cohort_def": {
        "description": "Example description",
        "filterSet": {
          "filters": {
            "Modality": [
              "CT",
              "MR"
            ],
            "age_at_diagnosis_btw": [
              65,
              75
            ],
            "collection_id": [
              "tcga_luad",
              "tcga_kirc"
            ],
            "race": [
              "WHITE"
            ]
          },
          "idc_data_version": "16.0"
        },
        "name": "mycohort",
        "sql": ""
      },
      "manifest": {
        "manifest_data": [
          {
            "Modality": "CT",
            "PatientID": "TCGA-50-6592",
            "SliceThickness": null,
            "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.9002.256822832756566055874151999412",
            "collection_id": "tcga_luad",
            "instance_count": 212,
            "series_count": 2
          },
          {
            "Modality": "CT",
            "PatientID": "TCGA-50-6595",
            "SliceThickness": "0.578125",
            "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.9002.414530650520592976265083061155",
            "collection_id": "tcga_luad",
            "instance_count": 1,
            "series_count": 1
          },
          {
            "Modality": "CT",
            "PatientID": "TCGA-B8-4153",
            "SliceThickness": "0.6",
            "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.8421.4004.499780439902438461273732269226",
            "collection_id": "tcga_kirc",
            "instance_count": 2,
            "series_count": 1
          },
          :
          :
          {
            "Modality": "MR",
            "PatientID": "TCGA-B0-5109",
            "SliceThickness": "20.0",
            "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.4004.167173047835125001355984228239",
            "collection_id": "tcga_kirc",
            "instance_count": 100,
            "series_count": 2
          }
        ],
        "rowsReturned": 324,
        "totalFound": 324
      },
      "next_page": ""
    }
    "fields": [
        "Modality",
        "SliceThickness",
        "collection_id",
        "patientID",
        "group_size",
        "counts"
    ]
    {
      "code": 200,
      "cohort_def": {
        "description": "Example description",
        "filterSet": {
          "filters": {
            "Modality": [
              "CT",
              "MR"
            ],
            "age_at_diagnosis_btw": [
              65,
              75
            ],
            "collection_id": [
              "tcga_luad",
              "tcga_kirc"
            ],
            "race": [
              "WHITE"
            ]
          },
          "idc_data_version": "16.0"
        },
        "name": "mycohort",
        "sql": ""
      },
      "next_page": "",
      "manifest": {
        "manifest_data": [
         {
            "Modality": "CT",
            "PatientID": "TCGA-50-6592",
            "SliceThickness": null,
            "collection_id": "tcga_luad",
            "instance_count": "212",
            "series_count": "2",
            "study_count": "1"
          },
          {
            "Modality": "CT",
            "PatientID": "TCGA-50-6595",
            "SliceThickness": "0.578125",
            "collection_id": "tcga_luad",
            "instance_count": "1",
            "series_count": "1",
            "study_count": "1"
          },
          {
            "Modality": "CT",
            "PatientID": "TCGA-B8-4153",
            "SliceThickness": "0.6",
            "collection_id": "tcga_kirc",
            "instance_count": "6",
            "series_count": "2",
            "study_count": "2"
          },
          :
          :
          {
            "Modality": "MR",
            "PatientID": "TCGA-B0-5109",
            "SliceThickness": "20.0",
            "collection_id": "tcga_kirc",
            "instance_count": "100",
            "series_count": "2",
            "study_count": "1"
          }
        ],
        "rowsReturned": 301,
        "totalFound": 301
      }
    }
    "fields": [
        "Modality",
        "SliceThickness",
        "collection_id",
        "patientID",
        "group_size",
        "counts"
    ]
    {
      "code": 200,
      "cohort_def": {
        "description": "Example description",
        "filterSet": {
          "filters": {
            "Modality": [
              "CT",
              "MR"
            ],
            "age_at_diagnosis_btw": [
              65,
              75
            ],
            "collection_id": [
              "tcga_luad",
              "tcga_kirc"
            ],
            "race": [
              "WHITE"
            ]
          },
          "idc_data_version": "16.0"
        },
        "name": "mycohort",
        "sql": ""
      },
      "manifest": {
        "manifest_data": [
          {
            "Modality": "CT",
            "SliceThickness": null,
            "collection_id": "tcga_luad"
            "instance_count": "212",
            "patient_count": "1",
            "series_count": "2",
            "study_count": "1"
          },
          {
            "Modality": "CT",
            "SliceThickness": "0.578125",
            "collection_id": "tcga_luad",
            "instance_count": "1",
            "patient_count": "1",
            "series_count": "1",
            "study_count": "1"
          },
          {
            "Modality": "CT",
            "SliceThickness": "0.6",
            "collection_id": "tcga_kirc",
            "instance_count": "29",
            "patient_count": "9",
            "series_count": "16",
            "study_count": "14"
          },
          :
          :
          {
            "Modality": "MR",
            "SliceThickness": "20.0",
            "collection_id": "tcga_kirc",
            "instance_count": "100",
            "patient_count": "1",
            "series_count": "2",
            "study_count": "1"
          }
        ],
        "rowsReturned": 88,
        "totalFound": 88
      }
      "next_page": "",
    }
    "fields": [
        "Modality",
        "SliceThickness",
        "patientID",
        "group_size",
        "counts"
    ]
    {
      "code": 200,
      "cohort_def": {
        "description": "Example description",
        "filterSet": {
          "filters": {
            "Modality": [
              "CT",
              "MR"
            ],
            "age_at_diagnosis_btw": [
              65,
              75
            ],
            "collection_id": [
              "tcga_luad",
              "tcga_kirc"
            ],
            "race": [
              "WHITE"
            ]
          },
          "idc_data_version": "16.0"
        },
        "name": "mycohort",
        "sql": ""
      },
      "manifest": {
        "manifest_data": [
          {
            "Modality": "CT",
            "SliceThickness": null,
            "collection_count": "1",
            "instance_count": "212",
            "patient_count": "1",
            "series_count": "2",
            "study_count": "1"
          },
          {
            "Modality": "CT",
            "SliceThickness": "0.578125",
            "collection_count": "1",
            "instance_count": "1",
            "patient_count": "1",
            "series_count": "1",
            "study_count": "1"
          },
          {
            "Modality": "CT",
            "SliceThickness": "0.6",
            "collection_count": "2",
            "instance_count": "34",
            "patient_count": "11",
            "series_count": "19",
            "study_count": "17"
          },
          {
          :
          :
          {
            "Modality": "MR",
            "SliceThickness": "20.0",
            "collection_count": "1",
            "instance_count": "100",
            "patient_count": "1",
            "series_count": "2",
            "study_count": "1"
          }
        ],
        "rowsReturned": 87,
        "totalFound": 87
      }
      "next_page": "",
    }

    Data release notes

    Data hosted by IDC is ingested from several sources, including The Cancer Imaging Archive (TCIA), Genomics Data Commons (GDC), Clinical Proteomic Tumor Analysis Consortium (CPTAC) and Human Tumor Atlas Network (HTAN).

    Please refer to the license and terms of use, which are defined in the license_url and source_doi or source_url of the IDC BigQuery dicom_all table. You can filter the data by license type in the IDC Portal, or programmatically, as demonstrated in this tutorial.

    IDC releases summary view

    V23 - Nov 2025

    Release counts

    Files: 46,870,903 (+175,736)

    Series: 994,073 (+28,666)

    Studies: 160,199 (+606)

    Cases: 79,889 (+355)

    New radiology collections

    New pathology collections

    New analysis results

    1. Collections analyzed:

      1. Lung-PET-CT-Dx

    2. Collections analyzed:

      1. NLST

    Revised radiology collections

    Revised pathology collections

    Revised analysis results

    New clinical metadata tables

    Revised clinical metadata tables

    1. varepop_apollo_clinical

    V22 - Sept 2025

    Release counts

    Files: 46,695,167 (+910,713)

    Series: 965,407 (+14,519)

    Studies: 159,593 (+10,016)

    Cases: 79,214 (+8,132)

    New radiology collections

    New pathology collections

    Revised radiology collections

    Revised pathology collections

    New clinical metadata tables

    1. bonemarrowwsi_pediatricleukemia_clinical

    2. cbis_ddsm_calc_case_description_test_set

    3. cbis_ddsm_calc_case_description_train_set

    4. cbis_ddsm_mass_case_description_test_set

    Revised clinical metadata tables

    1. varepop_apollo_clinical

    V21 - May 2025

    Release counts

    Files: 45,784,454 (+174,244)

    Series: 950,888 (+3,308)

    New radiology collections

    Revised radiology collections

    Revised pathology collections

    Revised analysis results

    New clinical metadata tables

    1. bamf_aimi_annotations_brain_mr_qa_results

    2. bamf_aimi_annotations_breast_fdg_pet_ct_qa_results

    3. bamf_aimi_annotations_breast_mr_qa_results

    4. bamf_aimi_annotations_kidney_ct_qa_results

    Renamed clinical metadata tables

    1. nlst_canc

      1. Previously nlst_clinical

    Retired clinical metadata tables

    1. acrin_nsclc_fdg_pet_bamf_lung_pet_ct_segmentation

      1. Subsumed by bamf_aimi_annotations_lung_fdg_pet_ct_qa_results

    2. anti_pd_1_lung_bamf_lung_ct_segmentation

      1. Subsumed by bamf_aimi_annotations_lung_ct_qa_results

    V20 - November 2024

    New radiology collections

    New pathology collections

    Revised radiology collections

    Revised pathology collections

    Revised analysis results

    1. Collections analyzed:

    2. Collections analyzed:

    New Clinical Metadata Tables

    v19 - September 2024

    New pathology collections

    New analysis results

    1. Collections analyzed:

    2. Collections analyzed:

    Revised radiology collections

    Cancer Moonshot Biobank (CMB) radiology images were updated to fix incorrect values assigned to PatientID (see details on the collection pages linked above). The updated images have different DICOM Study/Series/SOPInstanceUIDs.

    Revised analysis results

    1. Collections analyzed:

    New clinical metadata tables

    v18 - April 2024

    New radiology collections

    New analysis results

    1. * Collections analyzed:

    2. ** Collections analyzed:

    Revised radiology collections

    (starred collections are revised due to new or revised analysis results)

    1. (revisions only to clinical data)

    2. **

    Revised pathology collections

    (starred collections are revised due to new or revised analysis results)

    1. (fix PatientAges > 090Y)

    2. (fix PatientAges > 090Y)

    3. *

      1. Also added missing instance SOPInstanceUID: 1.3.6.1.4.1.5962.99.1.3459553143.523311062.1687086765943.9.0

    New clinical metadata tables

    Notes

    The deprecated columns tcia_api_collection_id and idc_webapp_collection_id have been removed from the auxiliary_metadata table in the idc_v18 BQ dataset. These columns were duplicates of columns collection_name and collection_id respectively.

    v17 - December 2023

    New radiology collections

    New analysis results

    1. Collections analyzed:

    Revised radiology collections

    New clinical metadata tables

    v16 - September 2023

    New radiology collections

    New pathology collections

    Revised radiology collections

    1. (TCIA description: (Repair of DICOM tag(0008,0005) to value "ISO_IR 100" in 79 series)

    2. (Revised because results from CPTAC-CRCC-Tumor-Annotations were added)

    3. (Revised because results from CPTAC-UCEC-Tumor-Annotations were added)

    New analysis results

    New clinical metadata tables

    v15 - July 2023

    New radiology collections

    New pathology collections

    1. (ICDC-Glioma radiology added in a previous version)

    Revised radiology collections

    1. (TCIA description: “Radiology modality data cleanup to remove extraneous scans.”)

    2. (“TCIA description: Radiology modality data cleanup to remove extraneous scans.”)

    3. (TCIA description: “Radiology modality data cleanup to remove extraneous scans.”)

    Revised pathology collections

    1. (11 pathology-only patients removed at request of data owner)

    2. (1 pathology-only patient removed at request of data owner)

    New analysis results

    1. (Analysis of NLST and NSCLC-Radiomics)

    Revised analysis results

    1. (Annotations of NLST and NSCLC-Radiomics radiology)

    New clinical metadata tables

    v14 - May 2023

    This release does not introduce any new data, but changes the bucket organization and introduces replication of IDC files in Amazon AWS storage buckets, as described in .

    v13 - Mar 2023

    New analysis results collection:

    New clinical data collections:

    v12 - Nov 2022

    New collections:

    Updated collections:

    Other:

    Metadata corresponding to "limited" access collections are removed.

    New clinical data collections:

    Other clinical data updates:

    Limited access collections are removed. Clinical metadata for the COVID-19-NY-SUB and ACRIN 6698/I-SPY2 Breast DWI collections now includes information ingested from data dictionaries associated with these collections. In v11 the string value 'NA' was being changed to null during the ETL process for some columns/collections. This is now fixed in v12 and the value 'NA' is preserved.

    v11 - Sept 2022

    This release introduces clinical data ingested for a subset of collections, and now available via a dedicated BigQuery dataset.

    New collections:

    v10 - Aug 2022

    In this release we introduce a new HTAN program including currently three collections release by the .

    New collections:

    Updated collections:

    CPTAC, TCGA and NLST collections have been reconverted due to a technical issue identified with a subset of images included in v9.

    1. *

    Note that the TCGA-KIRP and TCGA-BRCA collections (marked with the asterisk in the list above) are currently missing SM high resolution layer files/instances due to a of Google Healthcare that makes it not possible to ingest datasets that exceed some internal limits. Specifically, the following patient/studies are affected:

    • TCGA-KIRP: PatientID TCGA-5P-A9KA, StudyInstanceUID 2.25.191236165605958868867890945341011875563

    • TCGA-BRCA: PatientID TCGA-OL-A66H, StudyInstanceUID 2.25.82800314486527687800038836287574075736 The affected files will be included in IDC when the infrastructure limitation is addressed.

    Collection access level change:

    1. is now available as public access collection

    v9 - May 2022

    This data release introduces the concept of differential license to IDC: some of the collections maintained by IDC contain items that have different licenses. As an example, radiology component of the TCGA-GBM collection is covered by the TCIA limited access license, and is not available in IDC, while the digital pathology component is covered by CC-BY. With this release, we complete sharing in full of the digital pathology component of the datasets released by the CPTAC and TCGA programs.

    New collections:

    Updated collections:

    v8 - April 2022

    The main highlight of this release is the addition of the NLST and TCGA Slide Microscopy imaging data. New TCGA content includes introduction of new (to IDC) TCGA collections that have only slide microscopy component, and addition of the slide microscopy component to those IDC collections that were available earlier and included only the radiology component.

    New collections

    1. TCGA-DLBC (TCGA-DLBC collection does not have a description page)

    Updated collections

    v7 - February 2022

    The main highlight of this release is the addition of the Slide Microscopy imaging component to the remaining CPTAC collections.

    New collections

    Updated collections

    v6 - January 2022

    The following collections became limited access due to the , which is the original source of those collections.

    Original collections:

    Analysis results collections:

    v5 - December 2021

    New collections:

    New analysis results collections:

    • Outcome Prediction in Patients with Glioblastoma by Using Imaging, Clinical, and Genomic Biomarkers: Focus on the Nonenhancing Component of the Tumor ()

    • DICOM-SEG Conversions for TCGA-LGG and TCGA-GBM Segmentation Datasets ()

    Updated collections:

    v4 - September 2021

    is added. The data included consists of the following components:

    1) CT images available as any other imaging collection (via IDC Portal, BigQuery metadata tables, and storage buckets);

    2) a subset of clinical data available in the BigQuery tables starting with nlst_ under the idc_v4 dataset, as documented in the section.

    3) One instance is missing from patient/study/series: 126153/1.2.840.113654.2.55.319335498043274792486636919135185299851/1.2.840.113654.2.55.262421043240525317038356381369289737801

    4) Three instances are missing from patient/study/series: 215303/1.3.6.1.4.1.14519.5.2.1.7009.9004.337968382369511017896638591276/1.3.6.1.4.1.14519.5.2.1.7009.9004.180224303090109944523368212991

    v3 - August 2021

    The following radiology collections were updated to include DICOM Slide Microscopy (SM) images converted from the original vendor-specific representation into .

    The DICOM Slide Microscopy (SM) images included in the collections above in IDC are not available in TCIA. TCIA only includes images in the vendor-specific SVS format!

    v2 - June 2021

    Listed below are all of the and collections of currently hosted by IDC, with the links to the Digital Object Identifiers (DOIs) of those collections.

    New original collections:

    New analysis results collections:

    v1 - October 2020

    Listed below are all of the and collections of currently hosted by IDC, with the links to the Digital Object Identifiers (DOIs) of those collections.

    Original collections included:

    Analysis collections included:

    1. (only items corresponding to the LIDC-IDRI original collection are included)

    2. (only items corresponding to the ISPY1 original collection are included)

    Collections: 161 (no change)

    Analysis results collections: 23 (+6)

    Disk size: 95.33 TB (+2.22 TB)

    NLSTSeg Collections analyzed

    1. NLST

  • PROSTATEx-Targets Collections analyzed:

    1. ProstateX

  • TCGA-GBM360 Collections analyzed:

    1. TCGA-GBM

  • TCGA-SBU-TIL-Maps Collections analyzed:

    1. TCGA-ACC

    2. TCGA-BLCA

    3. TCGA-BRCA

  • CMB-MML
  • CMB-OV

  • CMB-PCA

  • VAREPOP-APOLLO

  • CPTAC-CCRCC

  • CPTAC-LUAD

  • VAREPOP-APOLLO

  • Collections: 161 (+11)

    Analysis results collections: 17 (no change)

    Disk size: 93.11 TB (+5.62 TB)

    CT4Harmonization-Multicentric
  • QIBA-CT-Liver-Phantom

  • QIN-BREAST-02

  • QIN PET Phantom

  • RIDER PHANTOM MRI

  • RIDER PHANTOM PET-CT

  • CMB-LCA
  • CMB-MEL

  • CMB-MML

  • CMB-OV

  • CMB-PCA

  • CPTAC-CCRCC

  • CPTAC-LUAD

  • VAREPOP-APOLLO

  • CMB-GEC
  • CMB-LCA

  • CMB-MEL

  • CMB-MML

  • CMB-OV

  • CMB-PCA

  • NLST

  • cbis_ddsm_mass_case_description_train_set
  • cc_radiomics_phantom_3_chest_settings

  • cc_radiomics_phantom_3_head_settings

  • cc_radiomics_phantom_3_manufacturer

  • Studies: 149,577 (+2,070)

    Cases: 71,082 (+1,893)

    Collections: 150 (+1)

    Analysis results collections: 17 (no change)

    Disk size: 87.49 TB (+1.94 TB)

    CPTAC-CCRCC
  • CPTAC-AML

  • CPTAC-PDA

  • bamf_aimi_annotations_liver2_ct_qa_results
  • bamf_aimi_annotations_liver_ct_qa_results

  • bamf_aimi_annotations_liver_mr_qa_results

  • bamf_aimi_annotations_lung2_ct_qa_results

  • bamf_aimi_annotations_lung_ct_qa_results

  • bamf_aimi_annotations_lung_fdg_pet_ct_qa_results

  • bamf_aimi_annotations_prostate_mr_qa_results

  • cptac_aml_demographic_classification

  • varepop_apollo_clinical

  • anti_pd_1_lung_bamf_lung_fdg_pet_ct_segmenation

    1. Subsumed by bamf_aimi_annotations_lung_fdg_pet_ct_qa_results

  • lung_pet_ct_dx_bamf_lung_ct_segmentation

    1. Subsumed by bamf_aimi_annotations_lung_ct_qa_results

  • lung_pet_ct_dx_bamf_lung_fdg_pet_ct_segmenation

    1. Subsumed by bamf_aimi_annotations_lung_fdg_pet_ct_qa_results

  • nsclc_radiogenomics_bamf_lung_ct_segmentation

    1. Subsumed by bamf_aimi_annotations_lung_ct_qa_results

  • nsclc_radiogenomics_bamf_lung_fdg_pet_ct_segmenation

    1. Subsumed by bamf_aimi_annotations_lung_fdg_pet_ct_qa_results

  • prostatex_bamf_segmentations

    1. Subsumed by bamf_aimi_annotations_prostate_mr_qa_results

  • qin_breast_bamf_breast_segmentation

    1. Subsumed by bamf_aimi_annotations_breast_fdg_pet_ct_qa_results

  • rider_lung_pet_ct_bamf_lung_ct_segmentation

    1. Subsumed by bamf_aimi_annotations_lung_ct_qa_results

  • rider_lung_pet_ct_bamf_lung_fdg_pet_ct_segmenation

    1. Subsumed by bamf_aimi_annotations_lung_fdg_pet_ct_qa_results

  • tcga_kirc_bamf_kidney_segmentation

    1. Subsumed by bamf_aimi_annotations_kidney_ct_qa_results

  • tcga_lihc_bamf_liver_ct_segmentation

    1. Subsumed by bamf_aimi_annotations_liver_ct_qa_results

  • tcga_lihc_bamf_liver_mr_segmentation

    1. Subsumed by amf_aimi_annotations_liver_mr_qa_results

  • tcga_luad_bamf_lung_ct_segmentation

    1. Subsumed by bamf_aimi_annotations_lung_ct_qa_results

  • tcga_luad_bamf_lung_mr_segmentation

    1. Subsumed by bamf_aimi_annotations_lung_fdg_pet_ct_qa_results

  • tcga_lusc_lung_ct_segmentation

    1. Subsumed by bamf_aimi_annotations_lung_ct_qa_results

  • tcga_lusc_lung_mr_segmentation

    1. Subsumed by bamf_aimi_annotations_lung_fdg_pet_ct_qa_results

  • CMB-MEL
  • CMB-MML

  • CMB-PCA

  • CMB-LCA
  • CMB-MEL

  • CMB-MML

  • CMB-PCA

  • TCGA-BRCA

  • TCGA-CESC

  • TCGA-COAD

  • TCGA-GBM

  • TCGA-LUAD

  • TCGA-LUSC

  • TCGA-PAAD

  • TCGA-PRAD

  • TCGA-READ

  • TCGA-SKCM

  • TCGA-STAD

  • TCGA-UCEC

  • TCGA-UVM

  • The segmentation of an instance in each of the following series was excluded due to having a DICOM PixelData size greater than or equal to 2GB:

    1. 1.2.826.0.1.3680043.10.511.3.10544506665348704312902213950958190

    2. 1.2.826.0.1.3680043.10.511.3.11183783347037364699862133130586654

    3. 1.2.826.0.1.3680043.10.511.3.11834745481756047014039855874680259

    4. 1.2.826.0.1.3680043.10.511.3.11901667084519361717338400810055642

    5. 1.2.826.0.1.3680043.10.511.3.12041600048156613329793822566495651

    6. 1.2.826.0.1.3680043.10.511.3.12718116375608495830041119776887887

    7. 1.2.826.0.1.3680043.10.511.3.13386724401829265460622415500801368

    8. 1.2.826.0.1.3680043.10.511.3.14042734131864468280344737986870899

    9. 1.2.826.0.1.3680043.10.511.3.17374765903080083648409690755539184

    10. 1.2.826.0.1.3680043.10.511.3.17429002643681869326389465422353495

    11. 1.2.826.0.1.3680043.10.511.3.20359930476040698387716730891020638

    12. 1.2.826.0.1.3680043.10.511.3.28397033639127902823368316410884210

    13. 1.2.826.0.1.3680043.10.511.3.28425539132321749931109935391487352

    14. 1.2.826.0.1.3680043.10.511.3.34574227972763695321794092913087775

    15. 1.2.826.0.1.3680043.10.511.3.36216094237641867532902805456135029

    16. 1.2.826.0.1.3680043.10.511.3.39533936694797964318706337783276378

    17. 1.2.826.0.1.3680043.10.511.3.39900930856460689132625586523683939

    18. 1.2.826.0.1.3680043.10.511.3.41633795217567037218184715094985555

    19. 1.2.826.0.1.3680043.10.511.3.42218106649761752724553401155203874

    20. 1.2.826.0.1.3680043.10.511.3.49098870621170235412220976183110770

    21. 1.2.826.0.1.3680043.10.511.3.50064322235999800062455171235601125

    22. 1.2.826.0.1.3680043.10.511.3.50905421517530127976832505410705816

    23. 1.2.826.0.1.3680043.10.511.3.62935684444056080516153739948364303

    24. 1.2.826.0.1.3680043.10.511.3.73572792121235596011940904319511291

    25. 1.2.826.0.1.3680043.10.511.3.74494366757564543824303304482444570

    26. 1.2.826.0.1.3680043.10.511.3.79988146996803179892075404247166692

    27. 1.2.826.0.1.3680043.10.511.3.80004293150506819482091023564947091

    28. 1.2.826.0.1.3680043.10.511.3.82774274518897141254234567300292686

    29. 1.2.826.0.1.3680043.10.511.3.84202416467561501610598853920808906

    30. 1.2.826.0.1.3680043.10.511.3.86214492184712627544696209982376598

    31. 1.2.826.0.1.3680043.10.511.3.90193069664920622990317347485104073

    32. 1.2.826.0.1.3680043.10.511.3.95666157880521064637011880609274546

    33. 1.2.826.0.1.3680043.10.511.3.96676982370873257329281821215166082

    34. 1.2.826.0.1.3680043.10.511.3.98258035017480972315346136181769675

  • RMS-Mutation-Prediction-Expert-Annotations

    WARNING: After the release of v20, it was discovered that a mistake had been made during data conversion that affected the newly-released segmentations accompanying the "RMS-Mutation-Prediction" collection. Segmentations released in v20 for this collection have the segment labels for alveolar rhabdomyosarcoma (ARMS) and embryonal rhabdomyosarcoma (ERMS) switched in the metadata relative to the correct labels. Thus segment 3 in the released files is labelled in the metadata (the SegmentSequence) as ARMS but should correctly be interpreted as ERMS, and conversely segment 4 in the released files is labelled as ERMS but should be correctly interpreted as ARMS. We apologize for the mistake and any confusion that it has caused, and will be releasing a corrected version of the files in the next release as soon as possible. Collections analyzed:

    1. RMS-Mutation-Prediction

  • CMB-LCA
  • CMB-MEL

  • CMB-MML

  • CMB-PCA

  • GTEx

  • TCGA-BRCA

  • TCGA-CESC

  • TCGA-COAD

  • TCGA-GBM

  • TCGA-LUAD

  • TCGA-LUSC

  • TCGA-PAAD

  • TCGA-PRAD

  • TCGA-READ

  • TCGA-SKCM

  • TCGA-STAD

  • TCGA-UCEC

  • TCGA-UVM

  • CMB-LCA
  • CMB-MEL

  • CMB-MML

  • CMB-PCA

  • CPTAC-CCRCC

  • CPTAC-LSCC

  • CPTAC-UCEC

  • NLM-Visible-Human-Project

  • RIDER Lung CT

  • CPTAC-CCRCC

  • Duke-Breast-Cancer-MRI

  • HCC-TACE-Seg

  • Lung-PET-CT-Dx

  • NLST

  • NSCLC Radiogenomics

  • Prostate-MRI-US-Biopsy

  • PROSTATEx

  • QIN-BREAST

  • QIN LUNG CT

  • RIDER Lung PET-CT

  • SPIE-AAPM Lung CT Challenge

  • TCGA-KICH

  • TCGA-KIRC

  • TCGA-KIRP

  • TCGA-LIHC

  • TCGA-LUAD

  • TCGA-LUSC

  • UPENN-GBM

  • acrin_contralateral_breast_mr_IA
  • acrin_contralateral_breast_mr_IM

  • acrin_contralateral_breast_mr_IS

  • acrin_contralateral_breast_mr_KS

  • acrin_contralateral_breast_mr_MS

  • acrin_contralateral_breast_mr_M4

  • acrin_contralateral_breast_mr_P8

  • acrin_contralateral_breast_mr_PA

  • acrin_contralateral_breast_mr_PD

  • acrin_contralateral_breast_mr_PE

  • acrin_contralateral_breast_mr_PR

  • acrin_contralateral_breast_mr_QA

  • advanced_mri_breast_lesions_clinical

  • upenn_gbm

  • Removed corrupted instances

    1. SOPInstanceUID: 1.3.6.1.4.1.5962.99.1.2164023716.1899467316.1685791236516.37.0

    2. SOPInstanceUID: 1.3.6.1.4.1.5962.99.1.2411736851.773458418.1686038949651.37.0

    3. SOPInstanceUID: 1.3.6.1.4.1.5962.99.1.2411736851.773458418.16860389

  • TCGA-BLCA (All TCGA revisions are to correct multiple manufacturer values within same series)

  • TCGA-BRCA

  • TCGA-CHOL

  • TCGA-COAD

  • TCGA-DLBC (No description page)

  • TCGA-ESCA

  • TCGA-HNSC

  • TCGA-KIRC

  • TCGA-KIRP

  • TCGA-LIHC

  • TCGA-LUAD

  • TCGA-LUSC

  • TCGA-PAAD

  • TCGA-PRAD

  • TCGA-READ

  • TCGA-SARC

  • TCGA-SKCM

  • TCGA-STAD

  • TCGA-TGCT

  • TCGA-THCA

  • TCGA-THYM

  • TCGA-UCEC

  • TCGA-UCS

  • lung_pet_ct_dx_bamf_lung_fdg_pet_ct_segmentation
  • nsclc_radiogenomics_bamf_lung_ct_segmentation

  • nsclc_radiogenomics_bamf_lung_fdg_pet_ct_segmentation

  • prostatex_bamf_segmentations

  • qin_breast_bamf_breast_segmentation

  • rider_lung_pet_ct_bamf_lung_ct_segmentation

  • rider_lung_pet_ct_bamf_lung_fdg_pet_ct_segmentation

  • tcga_kirc_bamf_kidney_segmentation

  • tcga_lihc_bamf_liver_ct_segmentation

  • tcga_lihc_bamf_liver_mr_segmentation

  • tcga_luad_bamf_lung_ct_segmentation

  • tcga_luad_bamf_lung_mr_segmentation

  • tcga_lusc_bamf_lung_ct_segmentation

  • tcga_lusc_bamf_lung_mr_segmentation

  • Vestibular-Schwannoma-MC-RC

    LUNG-PET-CT-Dx

  • NSCLC Radiogenomics

  • ProstateX

  • QIN-Breast

  • RIDER Lung PET-CT

  • TCGA-KIRC

  • TCGA-LIHC

  • TCGA-LUAD

  • TCGA-LUSC

  • Prostate-MRI-US-Biopsy-DICOM-Annotations Collections analyzed:

    1. Prostate-MRI-US-Biopsy

  • CMB-MEL

  • CMB-MML

  • CMB-PCA

  • CPTAC-CCRCC

  • CPTAC-PDA

  • ea1141_status_12mo

  • ea1141_status_6mo

  • ea1141_tomosynthesis

  • htan_ohsu_demographics

  • htan_vanderbilt_demographics

  • htan_vanderbilt_diagnosis

  • htan_vanderbilt_exposure

  • htan_vanderbilt_familyhistory

  • htan_vanderbilt_followup

  • htan_vanderbilt_moleculartest

  • htan_vanderbilt_therapy

  • remind_clinical

  • CPTAC-PDA (Revised because results from CPTAC-PDA-Tumor-Annotations were added)

    htan_hms_followup

  • htan_hms_moleculartheraphy

  • htan_ohsu_demographics

  • htan_ohsu_diagnosis

  • htan_ohsu_exposure

  • htan_ohsu_familyhistory

  • htan_ohsu_followup

  • htan_ohsu_moleculartheraphy

  • htan_wustl_demographics

  • htan_wustl_diagnosis

  • htan_wustl_exposure

  • htan_wustl_familyhistory

  • htan_wustl_followup

  • htan_wustl_moleculartheraphy

  • rms_mutation_prediction_demographics

  • rms_mutation_prediction_diagnosis

  • rms_mutation_prediction_sample

  • Prostate-Anatomical-Edge-Cases

  • RIDER Pilot

  • CPTAC-LUAD (TCIA description: “Radiology modality data cleanup to remove extraneous scans.”)
  • CPTAC-PDA (TCIA description: TCIA description: “Radiology modality data cleanup to remove extraneous scans.”)

  • CPTAC-SAR (TCIA description: “Radiology modality data cleanup to remove extraneous scans.”)

  • CPTAC-UCEC (TCIA description: “Radiology modality data cleanup to remove extraneous scans.”)

  • CT Lymph Nodes (TCIA description: “Added DICOM version of MED_ABD_LYMPH_MASKS.zip segmentations that were previously available”)

  • RIDER Lung CT (Revised because QIBA-VolCT-1B analysis results were added)

  • NLST (Revised because analysis results from nnU-Net-BPR-Annotations were revised)

  • NSCLC-Radiomics (Revised because analysis results from nnU-Net-BPR-Annotations were revised)

  • nlst_clinical

  • nlst_ctab

  • nlst_ctabc

  • nlst_prsn

  • nlst_screen

  • Pancreatic-CT-CBCT-SEG

    CMB-MML

  • CMB-PCA

  • GBM-DSC-MRI-DRO

  • HCC-TACE-Seg

  • PDMR-521955-158-R4

  • ISPY2

  • UPENN-GBM

  • CPTAC-COAD

  • CPTAC-GBM

  • CPTAC-HNSCC

  • CPTAC-LSCC

  • CPTAC-LUAD

  • CPTAC-OV

  • CPTAC-PDA

  • CPTAC-SAR

  • CPTAC-UCEC

  • Duke-Breast-Cancer-MRI

  • NLST

  • TCGA-ACC

  • TCGA-BLCA

  • TCGA-BRCA

  • TCGA-BRCA

  • TCGA-CESC

  • TCGA-CHOL

  • TCGA-COAD

  • TCGA-DLBC

  • TCGA-ESCA

  • TCGA-GBM

  • TCGA-GBM

  • TCGA-HNSC

  • TCGA-KICH

  • TCGA-KIRC

  • TCGA-KIRP *

  • TCGA-LGG

  • TCGA-LGG

  • TCGA-LIHC

  • TCGA-LUAD

  • TCGA-LUSC

  • TCGA-MESO

  • TCGA-OV

  • TCGA-PAAD

  • TCGA-PCPG

  • TCGA-PRAD

  • TCGA-READ

  • TCGA-SARC

  • TCGA-SKCM

  • TCGA-STAD

  • TCGA-TGCT

  • TCGA-THCA

  • TCGA-THYM

  • TCGA-UCEC

  • TCGA-UCS

  • TCGA-UVM

  • TCGA-GBM

  • TCGA-HNSC

  • TCGA-LGG

  • TCGA-PAAD

  • TCGA-PCPG

  • TCGA-SKCM

  • TCGA-TGCT

  • TCGA-THYM

  • TCGA-UCS

  • TCGA-UVM

  • TCGA-CESC

  • TCGA-COAD

  • TCGA-ESCA

  • TCGA-KICH

  • TCGA-KIRC

  • TCGA-KIRP

  • TCGA-LIHC

  • TCGA-LUAD

  • TCGA-LUSC

  • TCGA-OV

  • TCGA-PRAD

  • TCGA-READ

  • TCGA-SARC

  • TCGA-STAD

  • TCGA-THCA

  • TCGA-UCEC

  • CPTAC-AML

  • CPTAC-BRCA

  • CPTAC-COAD

  • CPTAC-OV

  • Pancreatic-CT-CBCT-SEG

  • Pediatric-CT-SEG

  • CPTAC-PDA

  • CPTAC-SAR

  • CPTAC-UCEC

  • Anti-PD-1_MELANOMA

  • Brain-Tumor-Progression

  • CPTAC-GBM

  • CPTAC-HNSCC

  • HEAD-NECK-RADIOMICS-HN1

  • HNSCC

  • HNSCC-3DCT-RT

  • Head-Neck Cetuximab

  • Head-Neck-PET-CT

  • IvyGAP

  • LGG-1p19qDeletion

  • MRI-DIR

  • OPC-Radiomics

  • QIN GBM Treatment Response

  • QIN-BRAIN-DSC-MRI

  • QIN-HEADNECK

  • REMBRANDT

  • RIDER NEURO MRI

  • TCGA-GBM

  • TCGA-HNSC

  • TCGA-LGG

  • Vestibular-Schwannoma-SEG

  • ACRIN-HNSCC-FDG-PET-CT

  • Duke-Breast-Cancer-MRI

  • NSCLC Radiogenomics

  • QIN-HEADNECK

  • Pseudo-PHI-DICOM-Data

  • Prostate Fused-MRI-Pathology

  • APOLLO

  • LGG-1p19qDeletion

  • Soft-tissue-Sarcoma

  • NSCLC-Radiomics-Genomics

  • Brain-Tumor-Progression

  • Head-Neck Cetuximab

  • CPTAC-GBM

  • CPTAC-SAR

  • CPTAC-LUAD

  • CPTAC-LSCC

  • Head-Neck-PET-CT

  • C4KC-KiTS

  • Breast-MRI-NACT-Pilot

  • 4D-Lung

  • Mouse-Mammary

  • CT Lymph Nodes

  • HNSCC

  • Breast-Cancer-Screening-DBT

  • MRI-DIR

  • Lung-PET-CT-Dx

  • NSCLC-RADIOMICS-INTEROBSERVER1

  • PDMR-BL0293-F563

  • CT COLONOGRAPHY

  • Phantom FDA

  • QIN-PROSTATE-Repeatability

  • PROSTATEx

  • AAPM-RT-MAC

  • ICDC-Glioma

  • RIDER Breast MRI

  • Anti-PD-1_MELANOMA

  • COVID-19-AR

  • PROSTATE-MRI

  • NaF PROSTATE

  • Mouse-Astrocytoma

  • ACRIN-DSC-MR-Brain

  • ACRIN-NSCLC-FDG-PET

  • QIN Breast DCE-MRI

  • RIDER NEURO MRI

  • MIDRC-RICORD-1A

  • MIDRC-RICORD-1C

  • REMBRANDT

  • NSCLC Radiogenomics

  • HNSCC-3DCT-RT

  • VICTRE

  • CPTAC-CM

  • CPTAC-PDA

  • CPTAC-UCEC

  • CPTAC-CCRCC

  • CPTAC-HNSCC

  • OPC-Radiomics

  • Vestibular-Schwannoma-SEG

  • SPIE-AAPM Lung CT Challenge

  • Lung Phantom

  • Pseudo-PHI-DICOM-Data

  • Pancreas-CT

  • QIN GBM Treatment Response

  • Pelvic-Reference-Data

  • Lung-Fused-CT-Pathology

  • Anti-PD-1_Lung

  • BREAST-DIAGNOSIS

  • RIDER Lung PET-CT

  • RIDER Lung CT

  • PDMR-292921-168-R

  • PDMR-833975-119-R

  • PDMR-997537-175-T

  • LCTSC

  • Prostate-3T

  • ACRIN-FLT-Breast

  • ACRIN-FMISO-Brain

  • PDMR-425362-245-T

  • Prostate-MRI-US-Biopsy

  • MIDRC-RICORD-1B

  • DRO-Toolkit

  • TCGA-LUSC

  • TCGA-KIRP

  • TCGA-THCA

  • TCGA-SARC

  • TCGA-ESCA

  • TCGA-CESC

  • TCGA-STAD

  • TCGA-COAD

  • TCGA-KICH

  • TCGA-READ

  • TCGA-LUAD

  • TCGA-LIHC

  • TCGA-BRCA

  • TCGA-OV

  • TCGA-KIRC

  • TCGA-LGG

  • TCGA-GBM

  • ISPY1 (ACRIN 6657)

  • QIN-HeadNeck

  • LIDC-IDRI

  • NSCLC-Radiomics

  • Lung-PET-CT-Dx-Annotations
    NLST-Sybil
    CMB-BRCA
    CMB-CRC
    CMB-LCA
    CMB-MEL
    BoneMarrowWSI-PediatricLeukemia
    CCDI-MCI
    TCGA-BRCA
    CBIS-DDSM
    CC-Radiomics-Phantom
    CC-Radiomics-Phantom-2
    CC-Radiomics-Phantom-3
    BoneMarrowWSI-PediatricLeukemia
    CMB-AML
    CMB-BRCA
    CMB-CRC
    CMB-GEC
    CCDI-MCI
    CMB-AML
    CMB-BRCA
    CMB-CRC
    CMB-BRCA
    VAREPOP-APOLLO
    CMB-LCA
    CMB-MEL
    CMB-MML
    CMB-PCA
    CCDI-MCI
    RMS-Mutation-Prediction-Expert-Annotations
    Mediastinal-Lymph-Node-SEG
    Spine-Mets-CT-SEG
    CMB-BRCA
    CMB-OV
    CMB-AML
    CMB-CRC
    CMB-GEC
    CMB-LCA
    CCDI-MCI
    CMB-AML
    CMB-CRC
    CMB-GEC
    BAMF-AIMI-Annotations
    UPENN-GBM
    Pan-Cancer-Nuclei-Seg-DICOM
    TCGA-BLCA
    mediastinal_lymph_node_seg_clinical
    spine_mets_ct_seg_clinical
    CCDI-MCI
    CMB-AML
    CMB-CRC
    CMB-GEC
    Pancreas-CT-SEG
    Pancreas-CT
    Pan-Cancer-Nuclei-Seg-DICOM
    TCGA-BLCA
    Advanced-MRI-Breast-Lesions
    CMB-AML
    CMB-CRC
    CMB-GEC
    BAMF-AIMI-Annotations
    ACRIN-NSCLC-FDG-PET
    Anti-PD-1_Lung
    Colorectal-Liver-Metastases
    acrin_contralateral_breast_mr_A0
    acrin_contralateral_breast_mr_AB
    acrin_contralateral_breast_mr_F1
    acrin_contralateral_breast_mr_I1
    Advanced-MRI-Breast-Lesions
    RMS-Mutation-Prediction-Expert-Annotations
    RMS-Mutation-Prediction
    TotalSegmentator-CT-Segmentations
    NLST
    Breast-Cancer-Screening-DBT
    NLST
    CPTAC-BRCA
    CPTAC-COAD
    RMS-Mutation-Prediction
    acrin_nsclc_fdg_pet_bamf_lung_pet_ct_segmentation
    anti_pd_1_lung_bamf_lung_ct_segmentation
    anti_pd_1_lung_bamf_lung_fdg_pet_ct_segmentation
    lung_pet_ct_dx_bamf_lung_ct_segmentation
    CMB-AML
    CT-Phantom4Radiomics
    EA1141
    ReMIND
    BAMF-AIMI-Annotations
    ACRIN-NSCLC-FDG-PET
    Anti-PD-1-Lung
    Prostate-MRI-US-Biopsy
    CMB-CRC
    CMB-GEC
    CMB-LCA
    ea1141_demographics
    ea1141_mri
    ea1141_risk_model
    ea1141_screening
    PDMR-Texture-Analysis
    RMS-Mutation-Prediction
    Breast-MRI-NACT-Pilot
    CPTAC-CRCC
    CPTAC-UCEC
    CPTAC-CRCC-Tumor-Annotations
    CPTAC-UCEC-Tumor-Annotations
    CPTAC-PDA-Tumor-Annotations
    htan_hms_demographics
    htan_hms_diagnosis
    htan_hms_exposure
    htan_hms_familyhistory
    Adrenal-ACC-Ki67-Seg
    CC-Tumor-Heterogeneity
    Colorectal-Liver-Metastases
    NLM-Visible-Human-Project
    HTAN-VANDERBILT
    ICDC-Glioma
    CPTAC-CCRCC
    CPTAC-CM
    CPTAC-LSCC
    CPTAC-GBM
    CPTAC-SAR
    QIBA-VolCT-1B
    nnU-Net-BPR-Annotations
    adrenal_acc_ki67_seg_clinical
    cc_tumor_heterogeneity_clinical
    colorectal_liver_metastases_clinical
    duke_breast_cancer_mri_clinical
    this section
    nnU-Net-BPR-annotations
    PROSTATEx
    CT-vs-PET-Ventilation-Imaging
    CTpred-Sunitinib-panNET
    CMB-CRC
    CMB-LCA
    CMB-MEL
    CMB-PCA
    CTpred-Sunitinib-panNET
    CMB-CRC
    CMB-GEC
    CMB-LCA
    CMB-MEL
    Human Tumor Atlas Network
    ACRIN-6698
    HTAN-HMS
    HTAN-OHSU
    HTAN-WUSTL
    CPTAC-AML
    CPTAC-BRCA
    CPTAC-CCRCC
    CPTAC-CM
    known limitation
    Vestibular-Schwannoma-SEG
    ACRIN-Contralateral-Breast-MR
    StageII-Colorectal-CT
    B-mode-and-CEUS-Liver
    CPTAC-GBM
    CPTAC-HNSCC
    Pediatric-CT-SEG
    TCGA-ACC
    TCGA-CHOL
    TCGA-MESO
    NLST
    TCGA-BLCA
    TCGA-BRCA
    TCGA-BRCA
    APOLLO-5-ESCA
    APOLLO-5-LUAD
    APOLLO-5-PAAD
    APOLLO-5-THYM
    CPTAC-CCRCC
    CPTAC-CM
    CPTAC-LSCC
    CPTAC-LUAD
    change in policy by TCIA
    AAPM-RT-MAC
    ACRIN-DSC-MR-Brain
    ACRIN-FMISO-Brain
    ACRIN-HNSCC-FDG-PET-CT
    DICOM-SEG Conversions for TCGA-LGG and TCGA-GBM Segmentation Datasets
    Outcome Prediction in Patients with Glioblastoma by Using Imaging, Clinical, and Genomic Biomarkers: Focus on the Nonenhancing Component of the Tumor
    COVID-19-NY-SBU
    B-mode-and-CEUS-Liver
    APOLLO-5-LSCC
    CMMD
    GBM-MR-NER-Outcomes
    DICOM-Glioma-SEG
    TCGA-GBM
    TCGA-LGG
    QIN-HEADNECK
    Breast-Cancer-Screening-DBT
    National Lung Screening Trial (NLST) collection
    Collection-specific BigQuery Tables
    dual personality DICOM-TIFF format
    CPTAC-LUAD
    CPTAC-LSCC
    original
    analysis results
    The Cancer Imaging Archive
    IvyGAP
    QIN LUNG CT
    LungCT-Diagnosis
    HEAD-NECK-RADIOMICS-HN1
    PROSTATEx Zone Segmentations
    High Resolution Prostate Segmentations for the ProstateX-Challenge
    RIDER Lung CT Segmentation Labels from: Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach
    original
    analysis results
    The Cancer Imaging Archive
    TCGA-PRAD
    TCGA-BLCA
    TCGA-UCEC
    TCGA-HNSC
    Standardized representation of the TCIA LIDC-IDRI annotations using DICOM
    QIN multi-site collection of Lung CT data with Nodule Segmentations
    DICOM SR of clinical data and measurement for breast cancer collections to TCIA
    TCGA-CESC
    TCGA-COAD
    TCGA-ESCA
    TCGA-HNSC
    TCGA-KIRC
    TCGA-LUAD
    TCGA-LUSC
    TCGA-MESO
    TCGA-OV
    TCGA-PAAD
    TCGA-PRAD
    TCGA-READ
    TCGA-SARC
    TCGA-SKCM
    TCGA-STAD
    TCGA-TGCT
    TCGA-THYM
    TCGA-UCEC
    TCGA-UVM