1 of 77

prod

Welcome!

Highlights

>85 TB of data: IDC contains radiology, brightfield (H&E) and fluorescence slide microscopy images, along with image-derived data (annotations, segmentations, quantitative measurements) and accompanying clinical data
free: all of the data in IDC is publicly available: no registration, no access requests
commercial-friendly: >95% of the data in IDC is covered by the permissive CC-BY license, which allows commercial reuse (small subset of data is covered by the CC-NC license); each file in IDC is tagged with the license to make it easier for you to understand and follow the rules
cloud-based: all of the data in IDC is available from both Google and AWS public buckets: fast and free to download, no out-of-cloud egress fees
harmonized: all of the images and image-derived data in IDC is harmonized into standard DICOM representation

Functionality

IDC is as much about data as it is about what you can do with the data! We maintain and actively develop a variety of tools that are designed to help you efficiently navigate, access and analyze IDC data:

visualization: examine images and image-derived annotations and analysis results from the convenience of your browser using integrated OHIF, VolView and Slim open source viewers
cohort building: use rich and extensive metadata to build subsets of data programmatically using idc-index or BigQuery SQL
download: use your favorite S3 API client or idc-index to efficiently fetch any of the IDC files from our public buckets

The overview of IDC is available in this open access publication. If you use IDC, please acknowledge us by citing it!

Getting started

We want Imaging Data Commons to be your companion in your cancer imaging research activities - from discovering relevant data to sharing your analysis results and showcasing the tools you developed!

Explore the data available

IDC Portal is integrated with powerful visualization tools: just with your web browser you will be able to see IDC images and annotations using OHIF Viewer, Slim viewer and VolView!

Subset the content you need

We have many tools to help you search data in IDC, so that you download only what you need!

Download the data you liked

once you have idc-index python package installed, download from the command line is as easy as running idc download <manifest_file>, or idc download <collection_id>.

Experiment with analysis tools

We want to make it easier to understand performance of the latest advances in AI on real-world cancer imaging data!

Scale the analysis to thousands of cloud VMs

With the cloud, you can do things that are simply impossible to do with your local resources.

If you have an algorithm, that you evaluated/published, that can enrich data in IDC with analysis results and you want to contribute those, or if you are a domain expert and would like to publish results of manual annotations you prepared - we want to hear from you!

through a dedicated Zenodo record you will have a citation and DOI to get credit for your work; your data is ingested from Zenodo into IDC, and citation will be generated for the users of your data in IDC

Questions?

Core functions

Easy and efficient access to public cancer imaging data

We ingest and distribute datasets from variety of sources and contributors, primarily focusing on large data collection initiatives sponsored by US National Cancer Institute.

On ingestion, we harmonize images and image-derived data into DICOM format for interoperability, whenever data is represented in a non-DICOM format.

Upon conversion, the data undergoes Extract-Transform-Load (ETL), which extracts DICOM metadata to make the data searchable, ingests the DICOM files into public S3 storage buckets and a DICOMweb store. Once the data is released, we provide various interfaces to access data and metadata.

Tools to simplify the use of the data

We are actively developing a variety of capabilities to make it easier for the users to work with the data in IDC. Some of the examples of those tools include

Support of continuous enrichment of data

We welcome you to apply to contribute analysis results and annotations of the images available in IDC! These can be expert manual annotations, analysis results generated using AI tools, segmentations, contours, metadata attributes describing the data (e.g., annotation of the scan type), expert evaluation of the quality of existing AI-generated annotations in IDC.

If your contribution is accepted by the IDC stakeholders:

we will work with you to choose the appropriate DICOM object type for your data and convert it into DICOM representation
once published in IDC
- your data will become searchable and viewable in IDC Portal, so it is easier for the users of your data to discover and work with your data
- files can be downloaded very efficiently using S3 interface and idc-index

Integration of cancer imaging data with other components of CRDC

Frequently asked questions

How to download data from IDC?

Check out the Downloading data documentation page!

How do I get my data into IDC?

Note that currently IDC prioritizes submissions from NCI-funded driving projects and data from special selected projects.

If you would like to submit images, it will be your responsibility to de-identify them first, documenting the de-identification process and submitting that documentation for the review by IDC stakeholders.

How much does it cost to use the cloud?

What is the status of IDC?

IDC pilot release took place in Fall 2020, followed by the production release in September 2021.

What data is available?

How to acknowledge IDC?

Please cite the latest paper from the IDC team. Please also make sure you acknowledge the specific data collections you used in your analysis.

What is the difference between IDC and TCIA?

IDC and TCIA are partners in providing FAIR data for cancer imaging researchers. While some of the functions between the two resources are similar, there are also key differences. The table below provides a summary of similarities and differences.

Function

IDC

TCIA

De-identification

no, IDC can only host data already de-identified

yes

Cloud-based data co-located with compute resources

yes

Conversion of pathology images and image-derived data into DICOM format

yes

Private data collections

yes

Public data collections

yes

Version control of the data

partial

Where do I learn more about other components of CRDC?

What about non-imaging data that accompanies IDC collections?

I want to search IDC content using an attribute not available in the portal

IDC Portal gives you access to just a small subset of the metadata accompanying IDC images. If you want to learn more about what is available, you have several options:

Support

Discounted use and training materials for NIH-funded investigators

Key pointers

Resources maintained by the IDC team

Other locations for accessing public imaging data

If you did not find the images you need in IDC, you can consider the following resources:

Publications

Publications by the IDC team

Publications referencing IDC (a subset)

IDC team

Imaging Data Commons is being developed by a team of engineers and imaging scientists with decades of experience in cancer imaging informatics, cloud computing, imaging standards, security, open source tool development and data sharing.

Our team includes the following sites and project leads:

Brigham and Women's Hospital, Boston, MA, USA (BWH)
- Andrey Fedorov, PhD, and Ron Kikinis, MD - Co-PIs of the project
- Hugo Aerts, PhD
- Cosmin Ciausu, MS
- Deepa Krishnaswamy, PhD
- Katie Mastrogiacomo
- Maria Loy
Institute for Systems Biology, Seattle, WA, USA (ISB)
- David Gibbs, PhD - site PI
- William Longabaugh, MS
- William Clifford, MS
- Suzanne Paquette, MS
- George White
- Ilya Shmulevich, PhD
General Dynamics Information Technology, Bethesda, MD, USA (GDIT)
- David Pot, PhD - site PI
- Poojitha Gundluru
- Fabian Seidl
- Prema Venkatesun
- Anthony Le
Fraunhofer MEVIS, Bremen, Germany (Fraunhofer MEVIS)
- André Homeyer, PhD - site PI
- Daniela Schacherer, MS
- Henning Höfener, PhD
Massachusetts General Hospital, Boston, MA, USA (MGH)
- Chris Bridge, DPhil - site PI
- Chris Gorman, PhD
Radical Imaging LLC, Boston, MA, USA (Radical Imaging)
- Rob Lewis, PhD - site PI
- Igor Octaviano
- Pedro Kohler
PixelMed Publishing, Bangor, PA, USA (PixelMed)
- David Clunie, MB, BS - site PI
Isomics Inc, Cambridge, MA, USA (Isomics)
- Steve Pieper, PhD - site PI

Oversight:

Leidos Biomedical Research
- Ulrike Wagner - project manager
- Todd Pihl - project manager
National Cancer Institute
- Erika Kim - federal lead
- Granger Sutton - federal lead

IDC Alumni

We are grateful to the following individuals who contributed to IDC in the past, but are no longer directly involved in the development of IDC.

Keyvan Farahani (NCI)
Markus Herrmann (MGH)
Davide Punzo (Radical Imaging)
James Petts (Radical Imaging)
Erik Ziegler (Radical Imaging)
Gitanjali Chhetri (Radical Imaging)
Rodrigo Basilio (Radical Imaging)
Jose Ulloa (Radical Imaging)
Madelyn Reyes (GDIT)
Derrick Moore (GDIT)
Mark Backus (GDIT)
Rachana Manandhar (BWH)
Rasmus Kiehl (Fraunhofer MEVIS)
Chad Osborne (GDIT)
Afshin Akbarzadeh (BWH)
Dennis Bontempi (BWH)
Vamsi Thiriveedhi (BWH)
Jessica Cienda (GDIT)
Bernard Larbi (GDIT)
Mi Tian (ISB)

Acknowledgments

Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.

Jobs

IDC does not currently have open positions

Data

Introduction

Data sources

Most of the data in IDC is received from the data collection initiatives/projects supported by US National Cancer Institute. Whenever source images or image-derived data is not in the DICOM format, it is harmonized into DICOM as part of the ingestion.

As of data release v21, IDC sources of data include:

- all DICOM files from the public collections are mirrored in IDC
- a subset of digital pathology collections and analysis results harmonized from vendor-specific representation (as available from TCIA) into DICOM Slide Microscopy (SM) format
- digital pathology slides harmonized into DICOM SM
- The Cancer Genome Atlas (TCGA) slides harmonized into DICOM SM
- release 1 of the HTAN data harmonized into DICOM SM
- v1 of the Visible Human images harmonized into DICOM MR/CT/XC
- digital pathology slides harmonized into DICOM SM

Data provenance

Whenever IDC replicates data from a publicly available source, we include the reference to the origin:

from the IDC Portal Explore page, click on the "i" icon next to the collection in the collections list

Data ingestion process

Simplified workflow for IDC data ingestion is summarized in the following diagram.

Organization of data

IDC data model

Files and metadata

Let's start with the overall principles of how we organize data in IDC.

IDC brings you (as of v18) over 60 TB of publicly available DICOM images and image-derived content. We share those with you as DICOM files, and those DICOM files are available in cloud-based storage buckets - both in Google and AWS.

Sharing just the files, however, is not particularly helpful. With that much data, it is no longer practical to just download all of those files to later sort through them to select those you need.

Think of IDC as a library, where each file is a book. With that many books, it is not feasible to read them all, or even open each one to understand what is inside. Libraries are of little use without a catalog!

To provide you with a catalog of our data, along with the files, we maintain metadata that makes it possible to understand what is contained within files, and select the files that are of interest for your project, so that you can download just the files you need. We make that metadata available in BigQuery tables searchable using standard SQL.

BigQuery Tables and Views

IDC utilizes BigQuery tables to organize metadata accompanying the files we host. If you have never worked with BigQuery before, you need to understand the basics of data organization in BQ.

BQ tables are organized in BQ datasets. BQ datasets are not unlike folders on your computer, but contain tables related to each other instead of files. BQ datasets, in turn, are organized under Google Cloud projects. GCP projects can be thought of as containers that are managed by a particular organization. To continue with the file system analogy, think about projects as hard drives that contain folders.

Let's map the aforementioned project-dataset-table hierarchy to the concrete locations that contain IDC data.

IDC BigQuery datasets

All of the IDC tables are stored under the bigquery-public-data project. That project is managed by Google Public Datasets Program, and contains many public BQ datasets, beyond those maintained by IDC.

All of the IDC tables are organized into datasets by data release version. If you complete the tutorial mentioned above, open the BQ console, and scroll down the list of datasets, you will find those that are named starting with the idc_v prefix - those are IDC datasets.

Following the prefix, you will find the number that corresponds to the IDC data release version. IDC data releases version numbers start from 1 and are incremented by one for each subsequent release. As of writing this, the most recent version of IDC is 16, and you can find dataset idc_v16 corresponding to this version.

Finally, you will also see two special datasets: idc_current and idc_current_clinical. Those two datasets are essentially aliases, or links, to the versioned datasets corresponding to the latest release of IDC data.

If you want to explore the latest content of IDC - use current datasets.

If you want to make sure your queries and data selection are reproducible - always use the version numbered datasets!

IDC BigQuery tables

BQ views can be very handy when you want to simplify your queries by factoring out the part of the query that is often reused. But a key disadvantage of BQ views over tables is the reduced performance and increased cost due to re-running the query each time you query the view.

As we will discuss further, most of the tables maintained by IDC are created by joining and/or post-processing other tables. Because of this we rely heavily on BQ views to improve transparency of the provenance of those "derived" tables. BQ views can be easily distinguished from the tables in a given dataset by a different icon. IDC datasets also follow a convention that all views in the versioned datasets include suffix _view in the name, and are accompanied by the result of running the query used by the view in a table that has the same name sans the _view suffix. See the figure below for an illustration of this convention.

Now that we reviewed the main concepts behind IDC tables organization, it is time to explain the sources of metadata contained in those tables. Leaving _clinical datasets aside, IDC tables are populated from one of the two sources:

DICOM metadata extracted from the DICOM files hosted by IDC, and various derivative tables that simplify access to specific DICOM metadata items;
collection-level and auxiliary metadata, which is not stored in DICOM tags, but is either received by IDC from other sources, or is populated by IDC as part of data curation (these include Digital Object Identifiers, description of the collections, hashsums, etc).

The set of BQ tables and views has grown over time. The enumeration below documents the BQ tables and views as of IDC v14. Some of these tables will not be found in earlier IDC BigQuery datasets.

`dicom_metadata`

Each row in the dicom_metadata table holds the DICOM metadata of an instance in the corresponding IDC version. There is a single row for each DICOM instance in the corresponding IDC version, and the columns correspond to the DICOM attributes encountered in the data across all of the ingested instances.

dicom_metadata table contains DICOM metadata extract from the files included in the given IDC data release. The amount and variety of the DICOM files grows with the new releases, and the schema of this table reflects the organization of the metadata in each IDC release. Non-sequence attributes, such as Modality or SeriesInstanceUID, once encountered in any one file will result in the corresponding column being introduced to the table schema (i.e., if we have column X in IDC release 11, in all likelihood it will also be present in all of the subsequent releases).

dicom_metadata can be used to conduct detailed explorations of the metadata content, and build cohorts using fine-grained controls not accessible from the IDC portal. Note that the dicom_all table, described below, is probably a better choice for such explorations.

Due to the existing limitations of Google Healthcare API, not all of the DICOM attributes are extracted and are available in BigQuery tables. Specifically:

sequences that contain around 1MiB of data are dropped from BigQuery export and RetrieveMetadata output currently. 1MiB is not an exact limit, but it can be used as a rough estimate of whether or not the API will drop the tag (this limitation was not documented as of writing this) - we know that some of the instances in IDC will be affected by this limitation. The fix for this limitation is targeted for sometime in 2021, according to the communication with Google Healthcare support.

`auxiliary_metadata`

This table defines the contents of the corresponding IDC version. There is a row for each instance in the version. We group the attributes for convenience:

Collection attributes:

tcia_api_collection_id: The ID, as accepted by the TCIA API, of the original data collection containing this instance (will be Null for collections not sourced from TCIA)
idc_webapp_collection_id: The ID, as accepted by the IDC web app, of the original data collection containing this instance
collection_id: The ID, as accepted by the IDC web app. Duplicate of idc_webapp_collection_id
collection_timestamp: Datetime when the IDC data in the collection was last revised
collection_hash: md5 hash of the of this version of the collection containing this instance
collection_init_idc_version: The IDC version in which the collection containing this instance first appeared
collection_revised_idc_version: The IDC version in which this version of the collection containing this instance first appeared

Patient attributes:

submitter_case_id:The Patient ID assigned by the submitter of this data. This is the same as the DICOM PatientID
idc_case_id:IDC generated UUID that uniquely identifies the patient containing this instance
This is needed because DICOM PatientIDs are not required to be globally unique
patient_hash: md5 hash of this version of the patient/case containing this instance
patient_init_idc_version: The IDC version in which the patient containing this instance first appeared
patient_revised_idc_version: The IDC version in which this version of the patient/case containing this instance first appeared

Study attributes:

StudyInstanceUID: DICOM UID of the study containing this instance
study_uuid:IDC assigned UUID that identifies a version the the study containing this instance.
study_instances: The number instances in the study containing this instance
study_hash: md5 hash of the data in this version of the study containing this instance
study_init_idc_version: The IDC version in which the study containing this instance first appeared
study_revised_idc_version: The IDC version in which this version of the study containing this instance first appeared

Series attributes:

SeriesInstanceUID: DICOM UID of the series containing this instance
series_uuid:IDC assigned UUID that identifies the version of the series containing this instance
source_doi:The DOI of an information page corresponding to the original data collection or analysis results that is the source of this instance
source_url:The URL of an information page that describes the original collection or analysis result that is the source of this instance
series_instances: The number of instances in the series containing this instance
series_hash: md5 hash of the data in the this version of the series containing this instance
access: Collection access status: 'Public' or 'Limited'. (Currently all data is 'Public')
series_init_idc_version: The IDC version in which the series containing this instance first appeared
series_revised_idc_version: The IDC version in which this version of the series containing this instance first appeared

Instance attributes:

SOPInstanceUID: DICOM UID of this instance.
instance_uuid:IDC assigned UUID that identifies the version of this instance.
gcs_url: The GCS URL of a file containing the version of this instance that is identified by this series_uuid/instance_uuid
aws_url: The AWS URL of a file containing the version of this instance that is identified by this series_uuid/instance_uuid
instance_hash: the md5 hash of this version of this instance
instance_size: the size, in bytes, of this version of this instance
instance_init_idc_version: The IDC version in which this instance first appeared
instance_revised_idc_version: The IDC version in which this version of this instance first appeared
license_url: The URL of a web page that describes the license governing this version of this instance
license_long_name: A long form name of the license governing this version of this instance
license_short_name: A short form name of the license governing this version of this instance

`mutable_metadata`

Some non-DICOM metadata may change over time. This includes the GCS and AWS URLs of instance data, the accessibility of each instance and the URL of an instance's associated description page. BigQuery metadata tables such as the auxiliary_metadata and dicom_all tables are never revised even when such metadata changes. However, tables in the datasets of previous IDC versions can be joined with the mutable_metadata table to obtain the current values of these mutable attributes.

The table has one row for each version of each instances:

crdc_instance_uuid: The uuid of an instance version
crdc_series_uuid: The uuid of a series version that contains this instance version
crdc_study_uuid: The uuid of a study version that contains the series version
gcs_url: URL to the Google Cloud Storage (GCS) object containing this instance version
aws_url: URL to the Amazon Web Services (AWS) object containing this instance version
`access: Current access status of this instance (Public or Limited)
source_url: The URL of a page that describes the original collection or analysis result that includes this instance
source_doi: The DOI of a page that describes the original collection or analysis result that includes this instance

`original_collections_metadata`

tcia_api_collection_id: The collection ID as is accepted by the TCIA AP
tcia_wiki_collection_id: The collection ID as on the TCIA wiki page
idc_webapp_collection_id:The collection ID as accepted by the IDC web app
Program: The program to which this collection belongs
Updated: Most recent update date reported by the collection source
Status:Collection status: "Ongoing" or "Complete"
Access:Collection access conditions: "Limited" or "Public"
ImageType: Enumeration of image types/modalities in the collection
Subjects:Number of subjects in the collection
DOI:DOI that can be resolved at doi.org to the TCIA wiki page for this collection
URL:URL of an information page for this collection
CancerType:Collection source(s) assigned cancer type of this collection
SupportingData:Type(s) of additional data available
Species: Species of collection subjects
Location:Body location that was studied
Description: Description of the collection (HTML format)
license_url: The URL of a web page that describes the license governing this collection
license_long_name: A long form name of the license governing this collection
license_short_name: A short form name of the license governing this collection

`analysis_results_metadata`

ID: Results ID
Title: Descriptive title
DOI:DOI that can be resolved at doi.org to the TCIA wiki page for this analysis result
CancerType:TCIA assigned cancer type of this analysis result
Location:Body location that was studied
Subjects:Number of subjects in the analysis result
Collections: Original collections studied
AnalysisArtifactsonTCIA: Type(s) of analysis artifacts generated
Updated: Data when results were last updated
license_url: The URL of a web page that describes the license governing this collection
license_long_name: A long form name of the license governing this collection
license_short_name: A short form name of the license governing this collection
description: Description of analysis result

`version_metadata`

Metadata for each IDC version, one row per version:

idc_version: IDC version number
version_hash: MD5 hash of hashes of collections in this version
version_timestamp: Version creation timestamp

The following tables and views consist of metadata derived from one or more other IDC tables tables for convenience of the user. For each such table, <table_name>, there is also a corresponding view, <table_name>_view, that, when queried, generates an equivalent table. These views are intended as a reference; each view's SQL is available to be used for further investigation.

`dicom_all`, `dicom_all_view`

All columns from dicom_metadata together with selected date from the auxiliary_metadata, original_collections_metadata, and analysis_results_metadata tables.

`segmentations`, `segmentations_view`

This table is derived from dicom_all to simplify access to the attributes of DICOM Segmentation objects available in IDC. Each row in this table corresponds to one DICOM Segmentation instance segment.

`measurement_groups`, `measurement_groups_view`

Each row corresponds to one TID1500 measurement group.

`qualitative_measurements`, `qualitative_measurements_view`

This table is derived from dicom_all to simplify access to the qualitative measurements in DICOM SR TID1500 objects. It contains coded evaluation results extracted from the DICOM SR TID1500 objects. Each row in this table corresponds to a single qualitative measurement extracted.

`quantitative_measurements`, `quantitative_measurements_view`

This table is derived from dicom_all to simplify access to the quantitative measurements in DICOM SR TID1500 objects. It contains quantitative evaluation results extracted from the DICOM SR TID1500 objects. Each row in this table corresponds to a single quantitative measurement extracted.

`dicom_metadata_curated`, `dicom_metadata_curated_view`

Curated values of DICOM metadata extracted from dicom_metadata.

`dicom_metadata_curated_series_level`, `dicom_metadata_curated_series_level_view`

Curated columns from dicom_metadata that have been aggregated/cleaned up to describe content at the series level. Each row in this table corresponds to a DICOM instance in IDC. The columns are curated by defining queries that apply transformations to the original values of DICOM attributes.

`idc_pivot_v<idc version>`

A view that is the basis for the queries performed by the IDC web app.

Collection-specific BigQuery tables

TCGA

The following tables contain TCGA-specific metadata:

tcga_biospecimen_rel9: biospecimen metadata
tcga_clinical_rel9: clinical metadata

NLST

nlst_canc: "Lung Cancer"
nlst_ctab: "SCT Abnormalities"
nlst_ctabc: "SCT Comparison Abnormalities"
nlst_prsn: "Participant"
nlst_screen: "SCT Screening"

Storage Buckets

The object namespace is hierarchical, where, for each version of a DICOM instance having instance UUID <instance_uuid> in a version of a series version having UUID <series_uuid>, the file name is:

<series_uuid>/<instance_uuid>.dcm

Corresponding files have the same object name in GCS and S3, though the name of the containing buckets will be different.

UIDs and UUIDs explained with an example

Consider an instance in the CPTAC-CM collection that has this SOPInstanceUID: 1.3.6.1.4.1.5962.99.1.171941254.777277241.1640849481094.35.0\

It is in a series having this SeriesInstanceUID: 1.3.6.1.4.1.5962.99.1.171941254.777277241.1640849481094.2.0

The instance and series were added to the IDC Data set in IDC version 7. At that point, the instance was assigned UUID: 5dce0cf0-4694-4dff-8f9e-2785bf179267 and the series was assigned this UUID: e127d258-37c2-47bb-a7d1-1faa7f47f47a

In IDC version 10, a revision of this instance was added (keeping its original SOPInstanceUID), and assigned this UUID: 21e5e9ce-01f5-4b9b-9899-a2cbb979b542

Because this instance was revised, the series containing it was implicitly revised. The revised series was thus issued a new UUID: ee34c840-b0ca-4400-a6c8-c605cef17630

Thus, the initial version of this instance has this file name: e127d258-37c2-47bb-a7d1-1faa7f47f47a/5dce0cf0-4694-4dff-8f9e-2785bf179267.dcm and the revised version of the instance has the this file name: ee34c840-b0ca-4400-a6c8-c605cef17630/21e5e9ce-01f5-4b9b-9899-a2cbb979b542.dcm

Both versions of the instance are in both AWS and GCS buckets.

Note that GCS and AWS bucket names are different. In fact, DICOM instance data is distributed across multiple buckets in both GCS and AWS. We will discuss obtaining GCS and AWS URLs more a little later.

It is possible that a series is revised, but one or more instances in the series are not revised. For example if a single instance in a series (assume the series has a uuid <series_uuid_old>) is revised, that instance gets a new UUID, and there is implicitly a new version of the series, which gets a new UUID (call it <series_uuid_new>). If an instance that is not revised has UUID <invariant_instance_uuid>, then its corresponding file in cloud storage will the have name: <series_uuid_old>/<invariant_instance_uuid>.dcm in the "old" series. But, because that same instance version is in the revised series, there must also be a file in cloud storage named: <series_uuid_new>/<invariant_instance_uuid>.dcm The result will be two distinct but identical files.

Utilities like gsutil, s3 and s5cmd "understand" the implied hierarchy in these file names. Thus the series UUID now acts like the name of a directory that contains all the instance versions in the series version:

and similarly for AWS buckets, thus making it easy to transfer all instances in a series from the cloud.

Because file names are more or less opaque, the user will not typically select files by listing the contents of a bucket. Instead, one should use either the IDC Portal or IDC BigQuery tables to identify items of interest and, then, generate a manifest of objects that can be passed to a utility like s5cmd.

DICOM Stores

Tutorials

DICOM

Portal

API

Cookbook

Files and metadata

We gratefully acknowledge and the that support public hosting of IDC-curated content, and cover out-of-cloud egress fees!

Let's start with the overall principles of how we organize data in IDC.

Sharing just the files, however, is not particularly helpful. With that much data, it is no longer practical to just download all of those files to later sort through them to select those you need.

In the following we describe organization of both the metadata catalog and the buckets containing the files. As you go over this documentation, please consider completing our - it will give you the opportunity to apply the knowledge you gain by reading this article while interacting with the data, and should help better understand this content.

BigQuery Tables and Views

Google is a massively-parallel analytics engine ideal for working with tabular data. Data stored in BQ can be accessed using queries.

IDC utilizes BigQuery tables to organize metadata accompanying the files we host. If you have never worked with BigQuery before, you need to understand the basics of data organization in BQ.

This may be a good time for you to complete , so that you are able to open the tables and datasets we will be discussing in the following paragraphs!

Let's map the aforementioned project-dataset-table hierarchy to the concrete locations that contain IDC data.

IDC BigQuery datasets

In addition to idc_v16 you will find a dataset named idc_v16_clinical. That dataset contains clinical data accompanying IDC collections. We started clinical data ingestion in IDC v11. If you want to learn more about the organization and searching of clinical data, take a look at the .

If you want to explore the latest content of IDC - use current datasets.

If you want to make sure your queries and data selection are reproducible - always use the version numbered datasets!

IDC BigQuery tables

Before we dive into discussing the individual tables maintained by IDC, there is just one more BigQuery-specific concept you need to learn: the view. BigQuery view is a table that is defined by an SQL query that is run every time you query the view (you can read more about BQ views in ).

If you are ever curious (and you should be, at least once in a while!) about the queries behind individual views, you can click on the view in the BQ console, and see the query in the "Details" tab. Try this out yourself to check the query for

DICOM metadata extracted from the DICOM files hosted by IDC, and various derivative tables that simplify access to specific DICOM metadata items;
collection-level and auxiliary metadata, which is not stored in DICOM tags, but is either received by IDC from other sources, or is populated by IDC as part of data curation (these include Digital Object Identifiers, description of the collections, hashsums, etc).

The set of BQ tables and views has grown over time. The enumeration below documents the BQ tables and views as of IDC v14. Some of these tables will not be found in earlier IDC BigQuery datasets.

`dicom_metadata`

Table in BigQuery:

IDC utilizes the standard capabilities of the Google Healthcare API to extract all of the DICOM metadata from the hosted collections into a single BQ table. Conventions of how DICOM attributes of various types are converted into BQ form are covered in the Google Healthcare API documentation article.

Sequence DICOM attributes, however, may have content that is highly variable across different DICOM instances (especially in Structured Reports). Those attributes will map to , and it is not unusual to see drastic differences in the corresponding columns of the table between different releases.

Due to the existing limitations of Google Healthcare API, not all of the DICOM attributes are extracted and are available in BigQuery tables. Specifically:

sequences that have more than 15 levels of nesting are not extracted (see ) - we believe this limitation does not affect the data stored in IDC
sequences that contain around 1MiB of data are dropped from BigQuery export and RetrieveMetadata output currently. 1MiB is not an exact limit, but it can be used as a rough estimate of whether or not the API will drop the tag (this limitation was not documented as of writing this) - we know that some of the instances in IDC will be affected by this limitation. The fix for this limitation is targeted for sometime in 2021, according to the communication with Google Healthcare support.

`auxiliary_metadata`

Table in BigQuery:

This table defines the contents of the corresponding IDC version. There is a row for each instance in the version. We group the attributes for convenience:

Collection attributes:

tcia_api_collection_id: The ID, as accepted by the TCIA API, of the original data collection containing this instance (will be Null for collections not sourced from TCIA)
idc_webapp_collection_id: The ID, as accepted by the IDC web app, of the original data collection containing this instance
collection_id: The ID, as accepted by the IDC web app. Duplicate of idc_webapp_collection_id
collection_timestamp: Datetime when the IDC data in the collection was last revised
collection_hash: md5 hash of the of this version of the collection containing this instance
collection_init_idc_version: The IDC version in which the collection containing this instance first appeared
collection_revised_idc_version: The IDC version in which this version of the collection containing this instance first appeared

Patient attributes:

submitter_case_id:The Patient ID assigned by the submitter of this data. This is the same as the DICOM PatientID
idc_case_id:IDC generated UUID that uniquely identifies the patient containing this instance
This is needed because DICOM PatientIDs are not required to be globally unique
patient_hash: md5 hash of this version of the patient/case containing this instance
patient_init_idc_version: The IDC version in which the patient containing this instance first appeared
patient_revised_idc_version: The IDC version in which this version of the patient/case containing this instance first appeared

Study attributes:

StudyInstanceUID: DICOM UID of the study containing this instance
study_uuid:IDC assigned UUID that identifies a version the the study containing this instance.
study_instances: The number instances in the study containing this instance
study_hash: md5 hash of the data in this version of the study containing this instance
study_init_idc_version: The IDC version in which the study containing this instance first appeared
study_revised_idc_version: The IDC version in which this version of the study containing this instance first appeared

Series attributes:

SeriesInstanceUID: DICOM UID of the series containing this instance
series_uuid:IDC assigned UUID that identifies the version of the series containing this instance
source_doi:The DOI of an information page corresponding to the original data collection or analysis results that is the source of this instance
source_url:The URL of an information page that describes the original collection or analysis result that is the source of this instance
series_instances: The number of instances in the series containing this instance
series_hash: md5 hash of the data in the this version of the series containing this instance
access: Collection access status: 'Public' or 'Limited'. (Currently all data is 'Public')
series_init_idc_version: The IDC version in which the series containing this instance first appeared
series_revised_idc_version: The IDC version in which this version of the series containing this instance first appeared

Instance attributes:

SOPInstanceUID: DICOM UID of this instance.
instance_uuid:IDC assigned UUID that identifies the version of this instance.
gcs_url: The GCS URL of a file containing the version of this instance that is identified by this series_uuid/instance_uuid
aws_url: The AWS URL of a file containing the version of this instance that is identified by this series_uuid/instance_uuid
instance_hash: the md5 hash of this version of this instance
instance_size: the size, in bytes, of this version of this instance
instance_init_idc_version: The IDC version in which this instance first appeared
instance_revised_idc_version: The IDC version in which this version of this instance first appeared
license_url: The URL of a web page that describes the license governing this version of this instance
license_long_name: A long form name of the license governing this version of this instance
license_short_name: A short form name of the license governing this version of this instance

`mutable_metadata`

Table in BigQuery:

The table has one row for each version of each instances:

crdc_instance_uuid: The uuid of an instance version
crdc_series_uuid: The uuid of a series version that contains this instance version
crdc_study_uuid: The uuid of a study version that contains the series version
gcs_url: URL to the Google Cloud Storage (GCS) object containing this instance version
aws_url: URL to the Amazon Web Services (AWS) object containing this instance version
`access: Current access status of this instance (Public or Limited)
source_url: The URL of a page that describes the original collection or analysis result that includes this instance
source_doi: The DOI of a page that describes the original collection or analysis result that includes this instance

`original_collections_metadata`

Table in BigQuery:

This table is comprised of IDC data collection-level metadata for the original TCIA data collections hosted by IDC, for the most part corresponding to the content available in . One row per collection:

tcia_api_collection_id: The collection ID as is accepted by the TCIA AP
tcia_wiki_collection_id: The collection ID as on the TCIA wiki page
idc_webapp_collection_id:The collection ID as accepted by the IDC web app
Program: The program to which this collection belongs
Updated: Most recent update date reported by the collection source
Status:Collection status: "Ongoing" or "Complete"
Access:Collection access conditions: "Limited" or "Public"
ImageType: Enumeration of image types/modalities in the collection
Subjects:Number of subjects in the collection
DOI:DOI that can be resolved at doi.org to the TCIA wiki page for this collection
URL:URL of an information page for this collection
CancerType:Collection source(s) assigned cancer type of this collection
SupportingData:Type(s) of additional data available
Species: Species of collection subjects
Location:Body location that was studied
Description: Description of the collection (HTML format)
license_url: The URL of a web page that describes the license governing this collection
license_long_name: A long form name of the license governing this collection
license_short_name: A short form name of the license governing this collection

`analysis_results_metadata`

Table in BigQuery:

Metadata for the TCIA analysis results hosted by IDC, for the most part corresponding to the content available in . One row per analysis result:

ID: Results ID
Title: Descriptive title
DOI:DOI that can be resolved at doi.org to the TCIA wiki page for this analysis result
CancerType:TCIA assigned cancer type of this analysis result
Location:Body location that was studied
Subjects:Number of subjects in the analysis result
Collections: Original collections studied
AnalysisArtifactsonTCIA: Type(s) of analysis artifacts generated
Updated: Data when results were last updated
license_url: The URL of a web page that describes the license governing this collection
license_long_name: A long form name of the license governing this collection
license_short_name: A short form name of the license governing this collection
description: Description of analysis result

`version_metadata`

Table in BigQuery:

Metadata for each IDC version, one row per version:

idc_version: IDC version number
version_hash: MD5 hash of hashes of collections in this version
version_timestamp: Version creation timestamp

Several of these tables/views are discussed more completely .

`dicom_all`, `dicom_all_view`

Table in BigQuery:

All columns from dicom_metadata together with selected date from the auxiliary_metadata, original_collections_metadata, and analysis_results_metadata tables.

`segmentations`, `segmentations_view`

Table in BigQuery:

`measurement_groups`, `measurement_groups_view`

Table in BigQuery:

This table is derived from dicom_all to simplify access to the measurement groups encoded in DICOM Structured Report TID 1500 objects available in IDC. Specifically, this table contains measurement groups corresponding to the "Measurement group" content item in the DICOM SR objects.

Each row corresponds to one TID1500 measurement group.

`qualitative_measurements`, `qualitative_measurements_view`

Table in BigQuery:

`quantitative_measurements`, `quantitative_measurements_view`

Table in BigQuery:

`dicom_metadata_curated`, `dicom_metadata_curated_view`

Table in BigQuery:

Curated values of DICOM metadata extracted from dicom_metadata.

`dicom_metadata_curated_series_level`, `dicom_metadata_curated_series_level_view`

Table in BigQuery:

`idc_pivot_v<idc version>`

A view that is the basis for the queries performed by the IDC web app.

Collection-specific BigQuery tables

Most clinical data is found in the . However, a few tables of clinical data are found in the idc_v<idc_version> datasets.

TCGA

The following tables contain TCGA-specific metadata:

tcga_biospecimen_rel9: biospecimen metadata
tcga_clinical_rel9: clinical metadata

NLST

IDC hosts a subset of the NLST clinical data, which was cleared for public sharing. If you need the full clinical data, please visit the .

The following tables contain NLST specific metadata. The detailed schema of those tables is available from the .

nlst_canc: "Lung Cancer"
nlst_ctab: "SCT Abnormalities"
nlst_ctabc: "SCT Comparison Abnormalities"
nlst_prsn: "Participant"
nlst_screen: "SCT Screening"

Storage Buckets

Storage Buckets are basic containers in Google Cloud Storage and AWS S3 that provide storage for data objects (you can read more about the relevant terms in the Google Cloud Storage documentation and in S3 ).

All IDC DICOM file data for all IDC data versions across all of the are maintained in Google Cloud Storage (GCS) and AWS S3 (S3) buckets. Currently all DICOM files are maintained in buckets that allow for free egress within or out of the cloud. This is enabled through the partnership of IDC with and the .

Note that only (versions of) DICOM instances have associated files (as discussed in . There are no per-series or per-study files.

The object namespace is hierarchical, where, for each version of a DICOM instance having instance UUID <instance_uuid> in a version of a series version having UUID <series_uuid>, the file name is:

<series_uuid>/<instance_uuid>.dcm

Corresponding files have the same object name in GCS and S3, though the name of the containing buckets will be different.

UIDs and UUIDs explained with an example

Consider an instance in the CPTAC-CM collection that has this SOPInstanceUID: 1.3.6.1.4.1.5962.99.1.171941254.777277241.1640849481094.35.0\

It is in a series having this SeriesInstanceUID: 1.3.6.1.4.1.5962.99.1.171941254.777277241.1640849481094.2.0

In IDC version 10, a revision of this instance was added (keeping its original SOPInstanceUID), and assigned this UUID: 21e5e9ce-01f5-4b9b-9899-a2cbb979b542

Because this instance was revised, the series containing it was implicitly revised. The revised series was thus issued a new UUID: ee34c840-b0ca-4400-a6c8-c605cef17630

Both versions of the instance are in both AWS and GCS buckets.

and similarly for AWS buckets, thus making it easy to transfer all instances in a series from the cloud.

DICOM Stores

IDC utilizes a single Google Healthcare DICOM store to host all of the instances in the current IDC version. That store, however, is primarily intended to support visualization of the data using the OHIF and Slim viewers. At this time, we do not support access of the hosted data via DICOMWeb interface by IDC users. See more details in the , and please comment about your use case if you have a need to access data via the DICOMweb interface.

Clinical data

Background

By clinical data we refer to the broad spectrum of image-related data that may accompany images. Such data may include demographics of the patients, observations related to their clinical history (therapies, diagnoses, findings), lab tests, surgeries.

Not only the terms used in the clinical data accompanying individual collection are not harmonized, but the format of the spreadsheets is also collection-specific. In order to search and navigate clinical data, one has to parse those collection specific tables, and there is no interface to support searching across collections.

Clinical data BigQuery tables

collection_id (STRING, NULLABLE) - the collection_id of the collection in the given table. The collection id is in a format used internally by the IDC Web App (with only lowercase letters, numbers and '_' allowed). It is equivalent to the idc_webapp_id field in the dicom_all view in the idc_current dataset.
table_name (STRING,NULLABLE) - name of the table
table_description (STRING,NULLABLE) - description of the type of data found in the table. Usually this is set to 'clinical data', unless a description is provided in the source files
idc_version_table_added (STRING, NULLABLE) - the IDC data version for which this table was first added
idc_table_added_datetime (STRING,NULLABLE) - the date/time this particular table was first generated
post_process_src (STRING, NULLABLE) - except for the CPTAC and TCGA collections the tables are curated from ZIP, Excel, and CSV files downloaded from the TCIA wiki. These files do not have a consistent structure and were not meant to be machine readable or to translate directly into BigQuery. A semi-manual curation process results in either a CSV of JSON file that can be directly written into a BigQuery table. post_process_src is the name of the JSON or CSV file that results from this process and is used to create the BigQuery table. This field is not used for the CPTAC- and TCGA-related tables
post_process_src_add_md5 (STRING, NULLABLE) - the md5 hash of post_process_src when the table was first added
idc_version_table_prior (STRING, NULLABLE) - the idc version the second most recent time the table was updated
post_process_src_prior_md5 (STRING, NULLABLE) - the md5 hash of post_process_src the second most recent time the table was updated
idc_version_table_updated (STRING, NULLABLE) - the idc version when the table was last updated
table_update_datetime (STRING, NULLABLE) - date and time an update of the table was last recorded
post_process_src_updated_md5 (STRING, NULLABLE) - the md5 hash of post_process_source when the table was last updated
number_batches (INTEGER, NULLABLE) - records the number of batches. Within the source data patients are sometimes grouped into different 'batches' (i.e. training vs test, responder vs non-responder etc.) and the batches are placed in different locations (i.e. different files or different sheets in the same Excel file)
source_info (RECORD, REPEATED) - an array of records with information about the table sources. These sources are either files downloaded from the TCIA wiki or another BigQuery table (as is the case for CPTAC and TCGA collections). There is a source_info record for each source 'batch' described above
source_info.srcs (STRING, REPEATED) - a source file downloaded from the TCIA wiki may be a ZIP file, and CSV file, or an Excel file. Sometimes the ZIP files contain other ZIP files that must be opened to extract the clinical data. In the source_info.src array the first string is the file that is downloaded from TCIA for this particular source batch. The final string is the CSV or Excel file that contains the clinical data. Any intermediate strings are the names of ZIP files 'in between' the downloaded file and the clinical file. For CPTAC and TCGA collections this field contains the source BigQuery table
source_info.md5 (STRING, NULLABLE) - md5 hash of the downloaded file from TCIA the most recent time the table was updated
source_info.table_last_modified (STRING, NULLABLE) - CPTAC and TCGA collections only. The date and time the source BigQuery table was most recently modified, as recorded when last copied
source_info.table_size (STRING, NULLABLE) - CPTAC and TCGA collections only. The size of the source BigQuery table as recorded when last copied

collection_id (STRING,NULLABLE) - the collection_id of the collection in the given table. The collection id is in a format used internally by the IDC Web App (with only lowercase letters, numbers and '_' allowed). It is equivalent to the idc_webapp_id field in the dicom_all view in the idc_current dataset.
case_col (BOOLEAN, NULLABLE) - true if the BigQuery column contains the patient or case id, i.e. if this column is used to determine the value of the dicom_patient_id column
table_name (STRING, NULLABLE) - table name
column (STRING, NULLABLE) - the actual column name in the table. For ACRIN collections the column_name is the variable_name from the provided data dictionary. For other collections it is a name constructed by 'normalizing' the column_label (see next) in a format that can be used as a BigQuery field name
column_label (STRING, NULLABLE) - a 'free form' label for the column that does not need to conform to the BigQuery column format requirements. For ACRIN collections this is the variable_label given by a data dictionary that accompanies the collection. For other collections it is the name or label of the clinical attribute as inferred from the source document during the curation process
data_type (STRING, NULLABLE) - the type of data in this column. Again for ACRIN collections this is provided in the data dictionary. For other collections it is inferred by analyzing the data during curation
original_column_headers (STRING, REPEATED) - the name(s) or label(s) in the source document that were used to construct the column_label field. In most cases there is one column label in the source document that perscribes the column_label. In some cases, multiple columns are concantenated and reformated to form the column_label
values (RECORD, REPEATED) - a structure that is borrowed from the ACRIN data model. This is an array that contains observerd attribute values for this given column. For ACRIN collections these values are reported in the data dictionary. For most other collections these values are determined by analyzing the source data. For simplicity this field is left blank when the number of unique values is greater than 20
values.option_code (STRING, NULLABLE) - a unique attribute value found in this column
values.option_description (STRING, NULLABLE) - a description of the option_code as provided by a data dictionary. For collections that do not have a data dictionary this is null.
values_source (STRING, NULLABLE) - indicates the source of the values records. The text 'provided dictionary' indicates that the records were obtained from a provided data dictionary. The text 'derived from inspection of values' indicates that the records were determined by automated analysis of the source materials during the ETL process that generated the BigQuery tables.
files (STRING, REPEATED) - names of the files that contain the source data for each batch. These are the Excel or CSV files directly downloaded from TCIA, or the files extracted from downloaded ZIP files
sheet_names (STRING, REPEATED) - for Excel-sourced files, the sheet names containing this column's values for each batch
batch (INTEGER, REPEATED) - source batches that contain this particular column. Some columns or attributes may be missing from some batches
column_numbers (STRING, REPEATED) - for each source batch, the column in the original source corresponding to this column in the BigQuery table

Files and metadata

Limited access content

In a future release of IDC we will by default exclude limited access items from what you select in the portal, so the portal selection should be more intuitive. But if you access the data via BigQuery queries you will need to know that “Limited” are not accessible and account for this in your query.Storage Buckets

BigQuery Tables

The flat address space of IDC DICOM objects in GCS storage is accompanied by BigQuery tables that allow the researcher to reconstruct the DICOM hierarchy as it exists for any given version. There are also several BQ tables and views in which we keep copies of the metadata exposed via the TCIA interface at the time a version was captured and other pertinent information.

There is an instance of each of the following tables and views per IDC version. The set of tables and views corresponding to an IDC version are collected in a single BQ dataset per IDC version, bigquery-public-data.idc_<idc_version_number> where bigquery-public-data is the project in which the dataset is hosted. As an example, the BQ tables for IDC version 4 are in the bigquery-public-data.idc_v4dataset.

In addition to the per-version datasets, the bigquery-public-data.idc-current dataset consists of a set of BQ views. There is a view for each table or view in the BQ data set corresponding to the current IDC release. Each such view in bigquery-public-data.idc-current is named identically to some table or view in the bigquery-public-data.idc_<idc_version_number> dataset of the current IDC release and can be used to access that table or view.

Several Google BigQuery (BQ) tables support searches against metadata extracted from the data files. Additional BQ tables define the composition of each IDC data version.

We maintain several additional tables that curate metadata non-DICOM metadata (e.g., attribution of a given item to a specific collection and DOI, collection-level metadata, etc).

- tcia_api_collection_id: The ID, as accepted by the TCIA API, of the original data collection containing this instance
- idc_webapp_collection_id:The ID, as accepted by the IDC web app, of the original data collection containing this instance
- collection_timestamp: Datetime when the IDC data in the collection was last revised
- source_doi:A DOI of the TCIA wiki page corresponding to the original data collection or analysis results that is the source of this instance
- collection_hash: The md5 hash of the sorted patient_hashes of all patients in the collection containing this instance
- collection_init_idc_version: The IDC version in which the collection containing this instance first appeared
- collection_revised_idc_version: The IDC version in which the collection containing this instance was most recently revised
Patient attributes:
- submitter_case_id:The submitter’s (of data to TCIA) ID of the patient containing this instance. This is the DICOM PatientID
- idc_case_id:IDC generated UUID that uniquely identifies the patient containing this instance
  This is needed because DICOM PatientIDs are not required to be globally unique
- patient_hash: the md5 hash of the sorted study_hashes of all studies in the patient containing this instance
- patient_init_idc_version: The IDC version in which the patient containing this instance first appeared
- patient_revised_idc_version: The IDC version in which the patient containing this instance was most recently revised
Study attributes:
- StudyInstanceUID: DICOM UID of the study containing this instance
- study_uuid:IDC assigned UUID that identifies a version of the study containing this instance.
- study_instances: The number instances in the study containing this instance
- study_hash: the md5 hash of the sorted series_hashes of all series in study containing this instance
- study_init_idc_version: The IDC version in which the study containing this instance first appeared
- study_revised_idc_version: The IDC version in which the study containing this instance was most recently revised
Series attributes:
- SeriesInstanceUID: DICOM UID of the series containing this instance
- series_uuid:IDC assigned UUID that identifies a version of the series containing this instance
- source_doi:A DOI of the TCIA wiki page corresponding to the original data collection or analysis results that is the source of this instance
- series_instances: The number of instances in the series containing this instance
- series_hash: the md5 hash of the sorted instance_hashes of all instance in the series containing this instance
- series_init_idc_version: The IDC version in which the series containing this instance first appeared
- series_revised_idc_version: The IDC version in which the series containing this instance was most recently revised
Instance attributes:
- SOPInstanceUID: DICOM UID of this instance.
- instance_uuid:IDC assigned UUID that identifies a version of this instance.
- gcs_url: The GCS URL of a file containing the version of this instance that is identified by the instance_uuid
- instance_hash: the md5 hash of the version of this instance that is identified by the instance_uuid
- instance_size: the size, in bytes, of this version of the instance that is identified by the instance_uuid
- instance_init_idc_version: The IDC version in which this instance first appeared
- instance_revised_idc_version: The IDC version in which this instance was most recently revised
- license_url: The URL of a web page that describes the license governing this instance
- license_long_name: A long form name of the license governing this instance
- license_short_name: A short form name of the license governing this instance

Due to the existing limitations of Google Healthcare API, not all of the DICOM attributes are extracted and are available in BigQuery tables. Specifically:

sequences that contain around 1MiB of data are dropped from BigQuery export and RetrieveMetadata output currently. 1MiB is not an exact limit, but it can be used as a rough estimate of whether or not the API will drop the tag (this limitation was not documented as of writing this) - we know that some of the instances in IDC will be affected by this limitation. The fix for this limitation is targeted for sometime in 2021, according to the communication with Google Healthcare support.

- tcia_api_collection_id: The collection ID as is accepted by the TCIA AP
- tcia_wiki_collection_id: The collection ID as on the TCIA wiki page
- idc_webapp_collection_id:The collection ID as accepted by the IDC web app
- Program: The program to which this collection belongs
- Updated: Moser recent update date reported by TCIA
- Status:Collection status" Ongoing or complete
- Access:Collection access conditions: Limited or Public
- ImageType: Enumeration of image types/modalities in the collection
- Subjects:Number of subjects in the collection
- DOI:DOI that can be resolved at doi.org to the TCIA wiki page for this collection
- CancerType:TCIA assigned cancer type of this collection
- SupportingData:Type(s) of additional data available
- Species: Species of collection subjects
- Location:Body location that was studied
- Description:TCIA description of the collection (HTML format)
- license_url: The URL of a web page that describes the license governing this collection
- license_long_name: A long form name of the license governing this collection
- license_short_name: A short form name of the license governing this collection
- ID: Results ID
- Title: Descriptive title
- DOI:DOI that can be resolved at doi.org to the TCIA wiki page for this analysis result
- CancerType:TCIA assigned cancer type of this analysis result
- Location:Body location that was studied
- Subjects:Number of subjects in the analysis result
- Collections: Original collections studied
- AnalysisArtifactsonTCIA: Type(s) of analysis artifacts generated
- Updated: Data when results were last updated
- license_url: The URL of a web page that describes the license governing this collection
- license_long_name: A long form name of the license governing this collection
- license_short_name: A short form name of the license governing this collection
cancer-idc.idc_v<version_number>.version_metadata (also available via the canceridc-data.idc-current.version_metadata view for the current version of IDC data). Metadata for each IDC version, one row per row:
- idc_version: IDC version number
- version_hash: MD5 hash of hashes of collections in this version
- version_timestamp: Version creation timestamp

view for the current version of IDC data) Measurement group sequences extracted from the DICOM SR TID1500 objects

The following tables contain TCGA-specific metadata:

tcga_biospecimen_rel9: biospecimen metadata
tcga_clinical_rel9: clinical metadata

Collection-specific BigQuery tables

Some of the collections are accompanied by BigQuery tables that have not been harmonized to a single data model. Those tables are available within the BigQuery dataset corresponding to a given release, and will have the name prefix corresponding to the short name of the collection. The list below discusses those collection-specific tables.

NLST

DICOM Stores

BigQuery tables external to IDC

In addition to the DICOM data, some of the image-related data hosted by IDC is stored in additional tables. These include the following:

Clinical data

Background

Not only are the terms used in the clinical data accompanying individual collection not harmonized, but the format of the spreadsheets is also collection-specific. In order to search and navigate clinical data, one has to parse those collection specific tables, and there is no interface to support searching across collections.

Clinical data BigQuery tables

collection_id (STRING, NULLABLE) - the collection_id of the collection in the given table. The collection id is in a format used internally by the IDC Web App (with only lowercase letters, numbers and '_' allowed). It is equivalent to the idc_webapp_id field in the dicom_all view in the idc_current dataset.
table_name (STRING,NULLABLE) - name of the table
table_description (STRING,NULLABLE) - description of the type of data found in the table. Usually this is set to 'clinical data', unless a description is provided in the source files
idc_version_table_added (STRING, NULLABLE) - the IDC data version for which this table was first added
idc_table_added_datetime (STRING,NULLABLE) - the date/time this particular table was first generated
post_process_src (STRING, NULLABLE) - except for the CPTAC and TCGA collections the tables are curated from ZIP, Excel, and CSV files downloaded from the TCIA wiki. These files do not have a consistent structure and were not meant to be machine readable or to translate directly into BigQuery. A semi-manual curation process results in either a CSV of JSON file that can be directly written into a BigQuery table. post_process_src is the name of the JSON or CSV file that results from this process and is used to create the BigQuery table. This field is not used for the CPTAC- and TCGA-related tables
post_process_src_add_md5 (STRING, NULLABLE) - the md5 hash of post_process_src when the table was first added
idc_version_table_prior (STRING, NULLABLE) - the idc version the second most recent time the table was updated
post_process_src_prior_md5 (STRING, NULLABLE) - the md5 hash of post_process_src the second most recent time the table was updated
idc_version_table_updated (STRING, NULLABLE) - the idc version when the table was last updated
table_update_datetime (STRING, NULLABLE) - date and time an update of the table was last recorded
post_process_src_updated_md5 (STRING, NULLABLE) - the md5 hash of post_process_source when the table was last updated
number_batches (INTEGER, NULLABLE) - records the number of batches. Within the source data patients are sometimes grouped into different 'batches' (i.e. training vs test, responder vs non-responder etc.) and the batches are placed in different locations (i.e. different files or different sheets in the same Excel file)
source_info (RECORD, REPEATED) - an array of records with information about the table sources. These sources are either files downloaded from the TCIA wiki or another BigQuery table (as is the case for CPTAC and TCGA collections). There is a source_info record for each source 'batch' described above
source_info.srcs (STRING, REPEATED) - a source file downloaded from the TCIA wiki may be a ZIP file, and CSV file, or an Excel file. Sometimes the ZIP files contain other ZIP files that must be opened to extract the clinical data. In the source_info.src array the first string is the file that is downloaded from TCIA for this particular source batch. The final string is the CSV or Excel file that contains the clinical data. Any intermediate strings are the names of ZIP files 'in between' the downloaded file and the clinical file. For CPTAC and TCGA collections this field contains the source BigQuery table
source_info.md5 (STRING, NULLABLE) - md5 hash of the downloaded file from TCIA the most recent time the table was updated
source_info.table_last_modified (STRING, NULLABLE) - CPTAC and TCGA collections only. The date and time the source BigQuery table was most recently modified, as recorded when last copied
source_info.table_size (STRING, NULLABLE) - CPTAC and TCGA collections only. The size of the source BigQuery table as recorded when last copied

collection_id (STRING,NULLABLE) - the collection_id of the collection in the given table. The collection id is in a format used internally by the IDC Web App (with only lowercase letters, numbers and '_' allowed). It is equivalent to the idc_webapp_id field in the dicom_all view in the idc_current dataset.
case_col (BOOLEAN, NULLABLE) - true if the BigQuery column contains the patient or case id, i.e. if this column is used to determine the value of the dicom_patient_id column
table_name (STRING, NULLABLE) - table name
column (STRING, NULLABLE) - the actual column name in the table. For ACRIN collections the column_name is the variable_name from the provided data dictionary. For other collections it is a name constructed by 'normalizing' the column_label (see next) in a format that can be used as a BigQuery field name
column_label (STRING, NULLABLE) - a 'free form' label for the column that does not need to conform to the BigQuery column format requirements. For ACRIN collections this is the variable_label given by a data dictionary that accompanies the collection. For other collections it is the name or label of the clinical attribute as inferred from the source document during the curation process
data_type (STRING, NULLABLE) - the type of data in this column. Again for ACRIN collections this is provided in the data dictionary. For other collections it is inferred by analyzing the data during curation
original_column_headers (STRING, REPEATED) - the name(s) or label(s) in the source document that were used to construct the column_label field. In most cases there is one column label in the source document that perscribes the column_label. In some cases, multiple columns are concantenated and reformated to form the column_label
values (RECORD, REPEATED) - a structure that is borrowed from the ACRIN data model. This is an array that contains observerd attribute values for this given column. For ACRIN collections these values are reported in the data dictionary. For most other collections these values are determined by analyzing the source data. For simplicity this field is left blank when the number of unique values is greater than 20
values.option_code (STRING, NULLABLE) - a unique attribute value found in this column
values.option_description (STRING, NULLABLE) - a description of the option_code as provided by a data dictionary. For collections that do not have a data dictionary this is null.
values_source (STRING, NULLABLE) - indicates the source of the values records. The text 'provided dictionary' indicates that the records were obtained from a provided data dictionary. The text 'derived from inspection of values' indicates that the records were determined by automated analysis of the source materials during the ETL process that generated the BigQuery tables.
files (STRING, REPEATED) - names of the files that contain the source data for each batch. These are the Excel or CSV files directly downloaded from TCIA, or the files extracted from downloaded ZIP files
sheet_names (STRING, REPEATED) - for Excel-sourced files, the sheet names containing this column's values for each batch
batch (INTEGER, REPEATED) - source batches that contain this particular column. Some columns or attributes may be missing from some batches
column_numbers (STRING, REPEATED) - for each source batch, the column in the original source corresponding to this column in the BigQuery table

Data release notes

IDC releases summary view

V20 - November 2024

New radiology collections

New pathology collections

Revised radiology collections

Revised pathology collections

Revised analysis results

The segmentation of an instance in each of the following series was excluded due to having a DICOM PixelData size greater than or equal to 2GB:
1. 1.2.826.0.1.3680043.10.511.3.10544506665348704312902213950958190
2. 1.2.826.0.1.3680043.10.511.3.11183783347037364699862133130586654
3. 1.2.826.0.1.3680043.10.511.3.11834745481756047014039855874680259
4. 1.2.826.0.1.3680043.10.511.3.11901667084519361717338400810055642
5. 1.2.826.0.1.3680043.10.511.3.12041600048156613329793822566495651
6. 1.2.826.0.1.3680043.10.511.3.12718116375608495830041119776887887
7. 1.2.826.0.1.3680043.10.511.3.13386724401829265460622415500801368
8. 1.2.826.0.1.3680043.10.511.3.14042734131864468280344737986870899
9. 1.2.826.0.1.3680043.10.511.3.17374765903080083648409690755539184
10. 1.2.826.0.1.3680043.10.511.3.17429002643681869326389465422353495
11. 1.2.826.0.1.3680043.10.511.3.20359930476040698387716730891020638
12. 1.2.826.0.1.3680043.10.511.3.28397033639127902823368316410884210
13. 1.2.826.0.1.3680043.10.511.3.28425539132321749931109935391487352
14. 1.2.826.0.1.3680043.10.511.3.34574227972763695321794092913087775
15. 1.2.826.0.1.3680043.10.511.3.36216094237641867532902805456135029
16. 1.2.826.0.1.3680043.10.511.3.39533936694797964318706337783276378
17. 1.2.826.0.1.3680043.10.511.3.39900930856460689132625586523683939
18. 1.2.826.0.1.3680043.10.511.3.41633795217567037218184715094985555
19. 1.2.826.0.1.3680043.10.511.3.42218106649761752724553401155203874
20. 1.2.826.0.1.3680043.10.511.3.49098870621170235412220976183110770
21. 1.2.826.0.1.3680043.10.511.3.50064322235999800062455171235601125
22. 1.2.826.0.1.3680043.10.511.3.50905421517530127976832505410705816
23. 1.2.826.0.1.3680043.10.511.3.62935684444056080516153739948364303
24. 1.2.826.0.1.3680043.10.511.3.73572792121235596011940904319511291
25. 1.2.826.0.1.3680043.10.511.3.74494366757564543824303304482444570
26. 1.2.826.0.1.3680043.10.511.3.79988146996803179892075404247166692
27. 1.2.826.0.1.3680043.10.511.3.80004293150506819482091023564947091
28. 1.2.826.0.1.3680043.10.511.3.82774274518897141254234567300292686
29. 1.2.826.0.1.3680043.10.511.3.84202416467561501610598853920808906
30. 1.2.826.0.1.3680043.10.511.3.86214492184712627544696209982376598
31. 1.2.826.0.1.3680043.10.511.3.90193069664920622990317347485104073
32. 1.2.826.0.1.3680043.10.511.3.95666157880521064637011880609274546
33. 1.2.826.0.1.3680043.10.511.3.96676982370873257329281821215166082
34. 1.2.826.0.1.3680043.10.511.3.98258035017480972315346136181769675

New Clinical Metadata Tables

v19 - September 2024

New pathology collections

New analysis results

Revised radiology collections

Cancer Moonshot Biobank (CMB) radiology images were updated to fix incorrect values assigned to PatientID (see details on the collection pages linked above). The updated images have different DICOM Study/Series/SOPInstanceUIDs.

Revised analysis results

New clinical metadata tables

v18 - April 2024

New radiology collections

New analysis results

Revised radiology collections

(starred collections are revised due to new or revised analysis results)

Revised pathology collections

(starred collections are revised due to new or revised analysis results)

1. Also added missing instance SOPInstanceUID: 1.3.6.1.4.1.5962.99.1.3459553143.523311062.1687086765943.9.0
2. Removed corrupted instances
  1. SOPInstanceUID: 1.3.6.1.4.1.5962.99.1.2164023716.1899467316.1685791236516.37.0
  2. SOPInstanceUID: 1.3.6.1.4.1.5962.99.1.2411736851.773458418.1686038949651.37.0
  3. SOPInstanceUID: 1.3.6.1.4.1.5962.99.1.2411736851.773458418.16860389
TCGA-DLBC (No description page)

New clinical metadata tables

Notes

The deprecated columns tcia_api_collection_id and idc_webapp_collection_id have been removed from the auxiliary_metadata table in the idc_v18 BQ dataset. These columns were duplicates of columns collection_name and collection_id respectively.

v17 - December 2023

New radiology collections

New analysis results

Collections analyzed:

Revised radiology collections

New clinical metadata tables

v16 - September 2023

New radiology collections

New pathology collections

Revised radiology collections

New analysis results

New clinical metadata tables

v15 - July 2023

New radiology collections

New pathology collections

Revised radiology collections

Revised pathology collections

New analysis results

Revised analysis results

New clinical metadata tables

v14 - May 2023

v13 - Mar 2023

New analysis results collection:

New clinical data collections:

v12 - Nov 2022

New collections:

Updated collections:

Other:

Metadata corresponding to "limited" access collections are removed.

New clinical data collections:

Other clinical data updates:

Limited access collections are removed. Clinical metadata for the COVID-19-NY-SUB and ACRIN 6698/I-SPY2 Breast DWI collections now includes information ingested from data dictionaries associated with these collections. In v11 the string value 'NA' was being changed to null during the ETL process for some columns/collections. This is now fixed in v12 and the value 'NA' is preserved.

v11 - Sept 2022

This release introduces clinical data ingested for a subset of collections, and now available via a dedicated BigQuery dataset.

New collections:

v10 - Aug 2022

New collections:

Updated collections:

CPTAC, TCGA and NLST collections have been reconverted due to a technical issue identified with a subset of images included in v9.

TCGA-DLBC

TCGA-KIRP: PatientID TCGA-5P-A9KA, StudyInstanceUID 2.25.191236165605958868867890945341011875563
TCGA-BRCA: PatientID TCGA-OL-A66H, StudyInstanceUID 2.25.82800314486527687800038836287574075736 The affected files will be included in IDC when the infrastructure limitation is addressed.

Collection access level change:

v9 - May 2022

This data release introduces the concept of differential license to IDC: some of the collections maintained by IDC contain items that have different licenses. As an example, radiology component of the TCGA-GBM collection is covered by the TCIA limited access license, and is not available in IDC, while the digital pathology component is covered by CC-BY. With this release, we complete sharing in full of the digital pathology component of the datasets released by the CPTAC and TCGA programs.

New collections:

Updated collections:

v8 - April 2022

The main highlight of this release is the addition of the NLST and TCGA Slide Microscopy imaging data. New TCGA content includes introduction of new (to IDC) TCGA collections that have only slide microscopy component, and addition of the slide microscopy component to those IDC collections that were available earlier and included only the radiology component.

New collections

TCGA-DLBC (TCGA-DLBC collection does not have a description page)

Updated collections

v7 - February 2022

The main highlight of this release is the addition of the Slide Microscopy imaging component to the remaining CPTAC collections.

New collections

Updated collections

v6 - January 2022

Original collections:

Analysis results collections:

v5 - December 2021

New collections:

New analysis results collections:

Updated collections:

v4 - September 2021

1) CT images available as any other imaging collection (via IDC Portal, BigQuery metadata tables, and storage buckets);

3) One instance is missing from patient/study/series: 126153/1.2.840.113654.2.55.319335498043274792486636919135185299851/1.2.840.113654.2.55.262421043240525317038356381369289737801

4) Three instances are missing from patient/study/series: 215303/1.3.6.1.4.1.14519.5.2.1.7009.9004.337968382369511017896638591276/1.3.6.1.4.1.14519.5.2.1.7009.9004.180224303090109944523368212991

v3 - August 2021

The DICOM Slide Microscopy (SM) images included in the collections above in IDC are not available in TCIA. TCIA only includes images in the vendor-specific SVS format!

v2 - June 2021

New original collections:

New analysis results collections:

v1 - October 2020

Original collections included:

Analysis collections included:

Viewer release notes

The version of the viewer is available from the "About" menu for the OHIF (radiology) viewer, and "Get app info" menu for the Slim (pathology) viewers. Both of those menus are in the upper right corner of the window.

v2-legacy - 8 Jun 2023 - OHIF

The final OHIF v2 published version is 4.12.45. Upstream changes based on v2 will be accessible through the v2-legacy branch (will not be published to NPM).

Main highlights from v2-legacy since 4.12.45:

Fix high and critical dependency issues reported by dependabot
Update SEG tolerance popup and update SEG thumbnail warning: Jump to first segment item image and show warning message only once on onChange events
Update to issues and PR templates
Address segmentation visibility toggle applied to all segmentations instead of the active one only
Update dcmjs version so it throws 'Failed to find the reference image in the source data. Cannot load this segmentation' error instead of logging a warning to console
Address eye icon for segment not shown when segment name is long
Change message for segmentation when it fails to load duo to orientation tolerance

4.12.45 - Oct 2022 - OHIF

Main highlights of this release include:

Handle missing ReferencedInstanceSequence attribute: Update parsing logic to consider attribute as optional.

4.12.44 - Oct 2022 - OHIF

Main highlights of this release include:

Remove unused code from DICOM SR parsing: Remove referecenedImages attribute from SR display sets. Within TID 1500, sub-template TID 1600 (Image Library) is not required while parsing SR for image references for annotations and planar measurements. The same is obtained from sub template TID 1501>TID 300>TID 320.

4.12.43 - Oct 2022 - OHIF

Main highlights of this release include:

Update message for segmentation error loading due to orientation tolerance

4.12.42 - Oct 2022 - OHIF

Main highlights of this release include:

Correct Parsing Logic for Qualitative Instance Level SR

4.12.41 - Sep 2022 - OHIF

Main highlights of this release include:

Fix 2d MPR rendering issue for the sagittal view

0.13.0 - April 2023 - Slim

New Features

Support configuration of multiple origin servers for different types of DICOM objects (SOP Storage Classes)

Enhancements

Improved error handling
Check Pyramid UID (if available) when grouping images into digital slides

Bug Fixes

Use Acquisition UID (if available) to group images into digital slides

0.11.2 - September 2022 - Slim

Main highlights of this release include:

New features

Add new tool to go to specific slide position;
Show mouse position in slide coordinate system.

Enhancements

Improve performance of translations between image and slide coordinates;
Automatically adjust size of overview image to size of browser window.

Bug fixes

Fix rendering of label image;
Show error message when creation of viewer fails;
Fix resolution of overview image;
Fix styling of point annotations;
Ensure bounding box annotations are axis aligned;
Add missing keyboard shortcut for navigation tool.

4.12.40 - September 2022 - OHIF

Main highlights of this release include:

Fix parsing of qualitative slice annotation;
Disable measurements panel interactions in MPR mode;
Fix parsing of segmentation when orientation values are close to zero;
Raise error if a frame StudyInstanceUID, SeriesInstanceUID and SOPInstanceUID are not conforming with the UID (DICOM UI VR) character repertoire;
Implements runtime tolerance for SEGs loading retry;
Fixed popup notifications behavior;
Update cornerstoneWADOImageLoader.

0.8.1 - June 2022 - Slim

Main highlights of this release include:

New features

Add panel for clinical trial information to case viewer;
Sort digital slides by Container Identifier attribute.

Enhancements

Reset style of optical paths to default when deactivating presentation state.

Bug fixes

Fix rendering of ROI annotations by upgrading to React version 1;
Correctly update UIDs of visible/active optical paths;
Fix type declarations of DICOMweb search resources.

4.12.30 - June 2022 - OHIF

Main highlights of this release include:

Add support for SR qualitative annotation per instance.

0.7.2 - June 2022 - Slim

Main highlights of this release include:

New features

Support DICOM Advanced Blending Presentation State to parametrize the display of multiplexed IF microscopy images;
Add key bindings for annotations tools;
Enable configuration of tile preload;
Enable configuration of annotation geometry type per finding;
Expose equipment metadata in user interface.

Enhancements

Improve default presentation of multiplexed IF microscopy images in the absence of presentation state instances;
Correctly configure DCM4CHEE Archive to use reverse proxy URL prefix for BulkDataURI in served metadata;
Enlarge display settings interfaces and add input fields for opacity, VOI limits, and colors;
Update dicom-microscopy-viewer version to use web workers for frame decoding/transformation operations;
Add button for user logout;
Disable optical path selection when a presentation state has been selected.

Bug fixes

Fix parsing of URL path upon redirect after successful authentication/authorization;
Fix configuration of optical path display settings when switching between presentation states;
Fix caching of presentation states and for selection via drop-down menu.

Security

Update dependencies with critical security issues.

0.5.1 - April 2022 - Slim

Main highlights of this release include:

Enhancements

Make overview panel collapsible and hide it entirely if lowest-resolution image is too large.

Bug fixes

Fix update of optical path settings when switching between slides.

4.12.26 - April 2022 - OHIF

Main highlights of this release include:

Fix regression in logic for finding segmentations referenced source image;
Fix segmentations loading issues;
Fix thumbnail series type for unsupported SOPClassUID;
Fix toolbar error when getDerivedDatasets finds no referenced series are found.

0.5.0 - March 2022 - Slim

Main highlights of this release include:

New features

Display of analysis results stored as DICOM Segmentation, Parametric Map, or Microscopy Bulk Simple Annotations instances;
Dynamic selection of DICOMweb server by user (can be enabled by setting AppConfig.enableServerSelection to true);
Dark app mode for fluorescence microscopy (can be enabled by setting App.mode to "dark");
Support display of segments stored in DICOM Segmentation instances;
Support display of parameter mappings stored in DICOM Parametric Map instances;
Support display of annotation groups stored in DICOM Microscopy Bulk Simple Annotations instances;
Implement color transformations using ICC Profiles to correct color images client side in a browser-independent manner;
Implement grayscale transformations using Palette Color Lookup Tables to pseudo-color grayscale images.

Improvements

Unify handling of optical paths for color and grayscale images;
Add loading indicator;
Improve styling of overview map;
Render specimen metadata in compacter form;
Improve fetching of WASM library code;
Improve styling of slide viewer sidebar;
Sort slides by Series Number;
Work around common standard compliance issues;
Update docker-compose configuration;
Upgrade dependencies;
Show examples in README;
Decode JPEG, JPEG 2000, and JPEG-LS compressed image frames client side in a browser-independent manner;
Improve performance of transformation and rendering operations using WebGL for both grayscale as well as color images;
Optimize display of overview images and keep overview image fixed when zooming or panning volume images;
Optimize HTTP Accept header field for retrieval of frames to work around issues with various server implementations.

Bug fixes

Ensure ROI annotations are re-rendered upon modification;
Clean up memory and recreate viewers upon page reload;
Fix selection of volume images;
Fix color space conversion during decoding of JPEG 2000 compressed image frames;
Fix unit of area measurements for ROI annotations;
Publish events when bulkdata loading starts and ends.

4.12.22 - March 2022 - OHIF

Main highlights of this release include:

Improve logic for finding segmentations referenced source image;
Improve debug dialog: fix text overflow and adding active viewports referenced SEGs and RTSTRUCT series.

4.12.17 - February 2022 - OHIF

Main highlights of this release include:

Fix fail to load SEG related to geometry assumptions;
Fix fail to load SEG related to tolerance;
Add initial support for SR planar annotations.

0.4.5 - January 2022 - Slim

Main highlights of this release include:

Bug fixes

Fix selection of VOLUME or THUMBNAIL images with different Photometric Interpretation.

4.12.12 - January 2022 - OHIF

Main highlights of this release include:

Fix RTSTRUCT right panel updates;
Fix SEG loading regression.

4.12.7 - December 2021 - OHIF

Main highlights of this release include:

Fix handling of datasets with unsupported modalities;
Fix backward fetch of images for the current active series.
Fix tag browser slider.

0.4.3 - November 2021 - Slim

Main highlights of this release include:

Bug fixes

Rotate box in overview map outlining the extent of the current view together with the image.

4.12.5 - November 2021 - OHIF

Main highlights of this release include:

Fix segmentation/rtstruct menu badge update when switching current displayed series;
Add to series thumbnail link icon if they are connected to any annotation (segmentation, etc...);
Fix problems opening series when the study includes many series;
Fix segments visibility handler.

0.4.1 - October 2021 - Slim

Main highlights of this release include:

Improvements

Include images with new flavor THUMBNAIL in image pyramid;
Properly fit overview map into HTML element and disable re-centering of overview map when user navigates main map;
Allow drawing of ROIs that extent beyond the slide coordinate system (i.e., allow negative ROI coordinates).

Bug fixes

Prevent display of annotation marker when ROI is deactivated

4.11.2 - October 2021 - OHIF

Main highlights of this release include:

Fix issues with segmentation orientations;
Fix display of inconsistencies warning for segmentation thumbnails;
Fix throttle thumbnail progress updates.

0.3.1 - September 2021 - Slim

Main highlights of this release include:

Bug fixes

Set PUBLIC_URL in Dockerfile.

0.3.0 - September 2021 - Slim

Main highlights of this release include:

Improvements

Add button to display information about application and environment;
Add ability to include logo;
Verify content of SR documents before attempting to load annotations;
Improve re-direction after authentication;
Add retry logic and error handlers for DICOMweb requests;
Improve documentation of application configuration in README;
Add unit tests.

Bug fixes

Disable zoom of overview map;
Fix pagination of worklist;
Prevent delay in tile rendering.

4.10.1 - September 2021 - OHIF

Main highlights of this release include:

Handle uncaught exception for non TID 1500 sr;
Added display of badge numbers in the segmentation / rtstruct panel tabs;
Study prefetcher with loading bar.

0.2.0 - August 2021 - Slim

Main highlights of this release include:

New features

Support for multiplexed immunofluorescence slide microscopy imaging;
Client-side additive blending of multiple channels using WebGL;
Client-side decoding of compressed frame items using WebAssembly based on Emscripten ports of libjpeg-turbo, openjpeg, and charls C/C++ libraries.

Improvements

Continuous integration testing pipeline using circle CI;
Deploy previews for manual regression testing.

Major changes

Introduce new configuration parameter renderer.

4.9.20 - June 2021 - OHIF

Main highlights of this release include:

Add exponential backoff and retry after 500 error;
Update to HTML SR viewport to display missing header tags.

0.1.0 - May 2021 - Slim

Inital Release.

4.9.17 - May 2021 - OHIF

Main highlights of this release include:

Add disable server cache feature;
Additional improvements on series inconsistencies report UI.

4.9.13 - April 2021 - OHIF

Main highlights of this release include:

Add acquisition storage SR sopclass to SR html ext;
Fix missing items in the segmentation combobox items at loading;
Fix slices are not sorted in geometrical order;
Extend series inconsistencies checks to segmentation and improve UI.

4.9.7 - March 2021 - OHIF

Main highlights of this release include:

Add new log service to be used by debugger extension;
Add UI to communicate to the users inconsistencies within a single series;
Add time in the dates of the items of the segmentation combobox list;
Order segmentation combobox list in reverse time order;
Fix failure to load a valid SEG object because of incorrect expectations about ReferencedSegmentNumber;
Fix RSTRUCT menu visibility when loading a series;
Fix image load slowness regression;
Fix choppy scrolling in 2D mod;
Fix failure to load segmentations when filtering study with '?seriesInstanceUID=' syntax.

4.8.10 - February 2021 - OHIF

Main highlights of this release include:

Replace instance dropdown to slider for dicom tag browser;
Add error page and not found pages if failed to retrieve study data.

4.8.5 - Jannuary 2021 - OHIF

Main highlights of this release include:

Add UI error report for MPR buffer limit related errors;
Add UI error report for hardware acceleration turned off errors;
Add IDC funding acknowledgment;
Fix RSTRUCT menu panel undefined variables;
Fix RSTRUCT menu visibility when loading a series;
Fix segments visibility control (SEG menu) bugs .

4.8.0 - December 2020 - OHIF

Main highlights of this release include:

Visualize overlapping segments;
Use runtime value configuration to get pkg version;
Fix navigation issues in the crosshair tool.

4.5.22 - October 2020 - OHIF

Main highlights of this release include:

Add MPR crosshair tool.

IDC API Concepts

The IDC API is based on IDC Data Model concepts. Several of these concepts have been previously introduced in the context of the IDC Portal. We discuss these concepts here with respect to the IDC API.

IDC Versions

As described previously, IDC data is versioned such that searching an IDC version according to some criteria (some filter set as described below) will always identify exactly the same set of DICOM objects.

The GET /versions API endpoint returns a list of the current and previous IDC data versions.

Original Collections

An original collection is a set of DICOM data provided by a single source. (We usually just use collection to mean original collection.) Such collections are comprised primarily of DICOM image data that was obtained from some set of patients. However some original collections also include annotations, segmentations or other analyses of the image data in the collection. Typically, the patients in an collection are related by a common cancer type, though this is not always the case.

The GET /collections endpoint returns a list of the original collections, in the current IDC version. Some metadata about each collection is provided.

Analysis Results

Analysis results are comprised of DICOM data that was generated by analyzing data in one or more original collections. Typically such analysis is performed by a different entity than that which provided the original collection(s) on which the analysis is based. Examples of data in analysis collections include segmentations, annotations and further processing of original images.

Because a DICOM instance in an analysis result is "in" the same series and study as the DICOM instance data of which it is an analysis result, it is also "in" the same patient, and therefore is considered to be "in" the same collection.

Specifically, each instance in IDC data has an associated collection_id. An analysis result will have the same collection_id as the original collection of which it is an analysis result.

The GET /analysis_results endpoint returns a list of the analysis results, with some metadata, in the current IDC version.

Filter Sets

A filter set selects some set of DICOM objects in IDC hosted data, and is a set of conditions, where each condition is defined by an attribute and an array of values. An attribute identifies a field (column) in some data source (BQ table). Each filter set also includes the IDC data version upon which it operates.

Filter sets are JSON encoded. Here is an example filter set:

A filter set selects a DICOM instance if, for every attribute in the filter set, the instance's corresponding value satisfies one or more of the values in the associated array of values. This is explained further below.

For example, the (attribute, [values]) pair ("Modality", ["MR", "CT"]) is satisfied if an instance "has" a Modality of MR or CT.

Note that if a filter set includes more than one (attribute, [values]) pair having the same attribute, then only the last such (attribute, [values]) pair is used. Thus if a filter group includes the (attribute, [values]) pairs ("Modality", ["MR"]) and ("Modality", ["CT"]), in that order, only ("Modality", ["CT"]) is used.

The filter set above will select any instance in the current IDC version that is in the TCGA_KIRC collection or the TCGA_LUAD' collections. To be selected by the filter, an instance must also have a Modality of CT or MR, and an age_at_diagnosis value between 65 and 75 .

Because of the hierarchical nature of DICOM, if a filter set selects an instance, it implicitly selects the series, study, patient and collection which contain that instance. A manifest can be configured to return data about some or all of these entities.

Note that when defining a cohort through the API, the IDC version is always the current IDC version.

Data Sources

IDC maintains a set of GCP BigQuery (BQ) tables containing various types of metadata that together describe IDC data.

In the context of the API, a data source (or just source) is a BQ table that contains some portion of the metadata against which a filter set is applied. An API query to construct a manifest is performed against one or more such tables as needed.

Attributes

Both the IDC Web App and API expose selected fields against which queries can be performed. The /filters endpoint returns the available filter attributes The /filters/values/{filter} endpoint returns a list of the values which a specified Categorical String or Categorical Numeric filter attribute will match. Each attribute has a data type, one of:

String: An attribute with data type String may have an arbitrary string value. For example, the possible values of a StudyDescription attribute are arbitrary. An object is selected if its String attribute matches any of the values in the values array. Matching is insensitive to the case (upper case, lower case) of the characters in the strings. Thus ("StudyDescription",["PETCT Skull-Thigh"] will match a StudyDescription containing the substring "PETCT SKULL-THIGH", or "petct skull-thigh" etc. Pattern matching in String attributes is also supported. The ('StudyDescription",["%SKULL%", "ABDOMEN%", "%Pelvis"]) filter will match any StudyDescription that contains "SKULL", "skull", "Skull", etc., starts with "ABDOMEN", "abdomen", etc., or ends with "Pelvis", "PELVIS", etc.
Categorical String An attribute with data type Categorical String will have one of a defined set of string values. For example, Modality is an Categorical String attribute that has possible values 'CT', 'MR', 'PT', etc. Categorical String attributes have the same matching semantics as for Strings. The /filters/values/{filter} endpoint returns a list of the values accepted for a specified Categorical String attribute (filter).
Categorical Numeric An attribute with data type Categorical Numeric has one of a defined set of numeric values. The corresponding value array must have a single numeric value. The (attribute, value array) pair for a Categorical Numeric is satisfied if the attribute is equal to the value in the value array. The /filters/values/{filter} endpoint returns a list of the values accepted for a Categorical Numeric attribute (filter).
Ranged Integer An attribute with data type Ranged Integer will have an integer value. For example, age_at_diagnosis is an attribute of data type Ranged Integer. In order to enable relative numeric queries, the API exposes eight variations of each Ranged Integer attribute as filter attribute names. These variations are the base attribute name with one of the suffixes: eq, gt, gte, btw, btwe, ebtw, ebtwe, lte, or lt, e.g. age_at_diagnosis_eq. The value array of the btw, btwe, ebtw, and ebtwe variations must contain exactly two integer values, in numeric order (least value first). The value array of the eq, gt, gte, lte, and lt variations must contain exactly one integer values. The (attribute, value array) pair for a Ranged Integer attribute is satisfied according to the suffix as follows:
- eq: If an attribute is equal to the value in the value array
- gt: If an attribute is greater than the value in the value array
- gte: If an attribute is greater than or equal to the value in the value array
- btw: if an attribute is greater than the first value and less than the second value in the value array
- ebtw: if an attribute is greater than or equal to the first value and less than the second value in the value array
- btwe: if an attribute is greater than the first value and less than or equal to the second value in the value array
- ebtwe: if an attribute is greater than or equal to the first value and less than or equal to the second value in the value array
- lte: If an attribute is less than or equal to the value in the value array
- lt: If an attribute is less than the value in the value array
Ranged Number An attribute with data type Ranged Number will have a numeric (integer or float) value. For example, diameter is an attribute of data type Ranged Number. In order to enable relative numeric queries, the API exposes eight variations of each Ranged Number attribute as filter attribute names. These variations are the base attribute name with one of the suffixes: eq, gt, gte, btw, btwe, ebtw, ebtwe, lte, or lt, e.g. diameter_eq. The value array of the btw, btwe, ebtw, and ebtwe variations must contain exactly two numeric values, in numeric order (least value first). The value array of the eq, gt, gte, lte, and lt variations must contain exactly one numeric values. The (attribute, value array) pair for a Ranged Number attribute is satisfied according to the suffix as follows:
- eq: If an attribute is equal to the value in the value array
- gt: If an attribute is greater than the value in the value array
- gte: If an attribute is greater than or equal to the value in the value array
- btw: if an attribute is greater than the first value and less than the second value in the value array
- ebtw: if an attribute is greater than or equal to the first value and less than the second value in the value array
- btwe: if an attribute is greater than the first value and less than or equal to the second value in the value array
- ebtwe: if an attribute is greater than or equal to the first value and less than or equal to the second value in the value array
- lte: If an attribute is less than or equal to the value in the value array
- lt: If an attribute is less than the value in the value array

Cohorts

The API supports defining and saving cohorts, as well as accessing the user's previously saved cohorts, whether defined through the portal or the API. Through the API, the user can obtain information about their previously defined cohorts, including the definition of each cohort in terms of a filter set and IDC version. The user can also obtain a manifest of the objects in the cohort. The data in the manifest is highly configurable and can be used, with suitable tools, to obtain DICOM files from cloud storage. A manifest returned by the API can include values from a large set of fields.

The POST /cohorts API endpoint creates and saves a cohort as defined by a set of filters and other cohort metadata. Here is an example JSON encoded cohort definition. :

Note that the cohort definition does not include an idc_data_version, because the cohort's version is implicitly the current IDC version when defining a new cohort.

The new cohort is saved under the IDC account of the caller of the API endpoint. The GET /cohorts API endpoint returns a list of the currently saved cohorts of the caller.

The DELETE /cohorts/{cohort_id} endpoint deletes a cohort as specified by its cohort_id. The DELETE /cohorts API endpoint deletes zero or more cohorts as specified by a list of cohort_ids. A user may only delete their own cohorts.

IDC API UI

Authenticating to the UI

Some of the API calls require authentication. This is denoted by a small lock symbol. Authentication can be performed by clicking on the ‘Authorize’ button at the top right of the page.

Make a Request

The API will return collection metadata for the current IDC data version.

Request Response

The Swagger UI submits the request and shows the curl code that was submitted. The Response body section will display the response to the request. The expected JSON schema format of the response to this API request is shown below:

The actual JSON formatted response can be downloaded to your local file system by clicking the ‘Download’ button.

Portal release notes

The version of the portal is shown at the bottom of the portal page. The semantics of the version is the following:

canceridc.<date of webapp deployment in YYYYMMDDHHMM>.<first 6 characters of the commit hash>,

18.0 April 9, 2024 (canceridc.202404091237.841769c)

on the Explore Images page the IDC internal id for each collection can now be copied from the Collections table by clicking the corresponding copy icon
on the Explore Images page the IDC case id can now be copied from the Selected Cases table by clicking the corresponding copy icon

17.0 December 19, 2023 (canceridc.202312191019.53f66ba)

Main highlights of this release include:

add a choice of several viewers (OHIF v2, OHIF v3, VolView, Slim) for viewing image files

16.0 September 15, 2023 (canceridc.202309151254.a6dfd6a)

Main highlights of this release include:

s5cmd file manifests can now be generated from the Explore images page for individual studies and series

15.0 July 14, 2023 (canceridc.202307141313.c80a691)

Main highlights of this release include:

The file manifest for a filter can be downloaded without logging into the portal and creating a persistent cohort

14.0 May 3, 2023 (canceridc.202305031458.443ea83)

Main highlights of this release include:

13.0 March 7, 2023 (canceridc.202303071044.57def9a)

Main highlights of this release include:

Three new Original Image attributes Max Total Pixel Matrix Columns, Max Total Pixel Matrix Rows, and Min Pixel Spacing are added.
Two new Quantitative Analysis attributes Sphericity (Quant) and Volume of Mesh are added.
Default attribute value order is changed from alphanumeric (by value name) to value count.

12.0 - November 2, 2022 (canceridc.202211092039.87ca478)

Main highlights of this release include:

As limited access collections have been removed from IDC, the portal is now simplified by removing the option of selecting different access levels. All collections in the portal are public.
A warning message appears on the cohort browser page when a user views a cohort that used the Access filter attribute. That attribute is no longer applied if the user migrates the cohort to the current version.
On the explorer page the reset button has been moved to improve viewability.

11.0 - September 8, 2022 (canceridc.202209081302.acb8ce3)

This was primarily a data release. There were no significant changes to the portal.

10.0 - August 3, 2022 (canceridc.202208040944.6c798a2)

Main highlights of this release include:

User control over how selection of multiple filter modalities defines the cohort. Previously when multiple modalities were selected the cohort would include the cases that had ANY of the selected modalities. Now the user can choose if the cohort includes the cases that contain ANY of the selected modaltiies or just those that have ALL of the selected modalities.

9.0 - May 19, 2022 (canceridc.202205191051)

Main highlights of this release include:

Ability to select specific Analysis Results collections with segmentation and radiomic features
Text boxes added to the slider panels to allow the user to input upper and lower slider bounds
Pie chart tooltips updated to improve viewability

8.0 - April 4, 2022 (canceridc.202204050856.2920c81)

Main highlights of this release include:

Eleven new collections added
Number of cases, studies, and series in a cohort are reported in the filter de finition
On the Exploration page the Access attribute is placed in the Search Scope
On the Exploration page users are warned when they create a cohort that includes Limited Access collections
Series Instance UID is reported in the Selected Series table

7.0 - February 7, 2022 (canceridc.202202071117.164252a)

Main highlights of this release include:

The BigQuery query string corresponding to a cohort can now be displayed in user-readable format by pressing a button on either the cohort or cohort list pages
On the exploration page collections can now be sorted alphabetically or by the number of cases. Selected cases are ordered at the top of the collection list
Table rows can be selected by clicking anywhere within the row, not just on the checkbox
The BigQuery export cohort manifest includes the IDC data version as an optional column

6.0 - January 10, 2022 (canceridc.202201101504.eb0e309)

Main highlights of this release include:

Collections which have limited access are now denoted as such in the Collection tab on the Exploration page
Links to image files belonging to limited collections have been removed from the Studies and Series tables on the Exploration page
The quota of image file data that can be served per user per day has been reduced from 137 to 40 GB

5.0 - December 9, 2021 (canceridc.202112091128.eb0e309)

Main highlights of this release include:

New attributes including Manufacturer, Manufacturer Model Name, and Slice Thickness added
Checked attribute values are now shown at the top of the attribute value lists
Ability to search by CaseID added to the Selected Cases table
Ability to search by StudyID added to the Selected Studies table
Study Date added to the Studies Table
Changed the persistence of the StudyID tooltip in the tables so that the StudyID can be copied from the tooltip
Specific columns can now be selected in the BigQuery cohort export

2.1.0 - August 2021 (canceridc.202108261153.70f59e0)

The Imaging Data Commons Explore Image Data portal is a platform that allows users to explore, filter, create cohorts, and view image studies and series using cutting-edge technology viewers.

Main highlights of this release include:

Support for slide microscopy series from the CPTAC-LSCC and CPTAC-LUAD collections is now included.
Search boxes are included for very attribute to search for specific attribute values by name.

2.0.0 - June 2021 - (canceridc.202106250849.876f912)

The Imaging Data Commons Explore Image Data portal is a platform that allows users to explore, filter, create cohorts, and view image studies and series using cutting-edge technology viewers.

Main highlights of this release include:

112 data collections are now included
Cohort data version is reported
Cohort statistics - ie the number the cases, studies, and series per cohort are now reported
Mechanism included to update a version cohort
Species Attribute is included
Checkbox and plus/minus icons are now used to select table rows

1.3.0 - March 2021 (canceridc.202103011131.27ce3b3)

The Imaging Data Commons Explore Image Data portal is a platform that allows users to explore, filter, create cohorts, and view image studies and series using cutting-edge technology viewers.

Main highlights of this release include:

The user details page will no longer return a 500 error when selected
Sorting of studies panel is now active for all fields
Re-sending of an unreceived verification email is now more clearly explained.
IDC identity login header and column selection is disabled for the exportation of a cohort manifest to BigQuery
Detailed information panel added to efficiently describe why some pie charts have multiple facets even when a filter is selected
Cohort manifest export popup can be scrolled down
Use of Shift or Control (Command for Mac) selection of studies will now behave as expected: Shift-select for a contiguous series of rows, Control/Command-select for individual rows.
All filter selections are now sorted by alphabetical characters

1.2.0 - January 2021 (canceridc.202101111506.0a8af57)

The Imaging Data Commons Explore Image Data portal is a platform that allows users to explore, filter, create cohorts, and view image studies and series using cutting-edge technology viewers.

Main highlights of this release include:

Consistent number of files will be returned between the portal and BigQuery
When the user clicks a non-gov link a popup will appear
Cohort manifest export information now has clickable URLs to take you to the BigQuery console
Collections list displays by default 100 entries
Any empty search criteria is now highlighted in grey and no data will be listed
The user will no longer need to scroll to see search criteria in the left search configuration panel
Portal footer is now in compliance with NCI requirements
Check/uncheck in the collections panel added for collection TCGA

1.1.0 - December 2020 (canceridc.202012091728.674fff0)

The Imaging Data Commons Explore Image Data portal is a platform that allows users to explore, filter, create cohorts, and view image studies and series using cutting-edge technology viewers.

Main highlights of this release include:

Case-level table is added to the portal
Cohorts can now be exported into BigQuery tables using the Export Cohort Manifest button
Cohorts less than 650k rows can now be downloaded as a multipart file. Cohorts larger that 600k rows can only be be exported to BigQuery (for users that are logged in with Google Accounts)
Quantitative filter ranges are updated dynamically with the updates to filter selection
Pie charts will display "No data available" message when zero cases are returned for the given filter selection
RTPLAN and Real World Mapping Attribute values are now disabled at the series level, since they cannot be visualized in the IDC Viewer
Various bug fixes in both the IDC Portal and IDC Viewer

1.0.0 - October 2020 (canceridc.202010190226.4e8597)

The Imaging Data Commons Explore Image Data portal is a platform that allows users to explore, filter, create cohorts, and view image studies and series using cutting-edge technology viewers.

Main features in this initial release include:

The ability to search for data in BigQuery and Solr
The ability to search by multiple attributes:
- Collection
- Original attributes e.g., Modality
- Derived attributes e.g., Segmentations
- Qualitative analysis e.g., Lobular pattern
- Quantitative analysis e.g., Volume
- related attributes e.g., Country
Display of collections results in a tabular format with the following information:
- Collection Name
- Total Number of Cases
- Number of Cases(this cohort)
Display of the Selected Studies results in tabular format with the following information:
- Project Name
- Case ID
- Study ID
- Study Description
Display of the Selected Series results in tabular format with the following information:
- Study ID
- Series Number
- Modality
- Body Part Examined
- Series Description
The ability to hide attributes with zero cases present
The ability to save cohorts
The ability to download the manifest of any cohort created
The ability to promote, filter, and load multiple series instances in the OHIF viewer

Manifests

A manifest is a table of access methods and other metadata of the objects in some cohort. There are two manifest endpoints. The POST /cohorts/manifest/{cohort_id} API endpoint returns a manifest of some previously defined cohort. Parameters are send to the endpoint in the request body. The JSON schema of the manifestBody can be seen on the IDC API v2 UI page. Here is an example:

{
  "fields": [
    "Age_At_Diagnosis",
    "aws_bucket",
    "crdc_series_uuid",
    "Modality",
    "SliceThickness"
  ],
  "counts": false,
  "group_size": false,
  "sql": false,
  "page_size": 1000
}

The fields parameter of the body indicates the fields whose values are to be included in the returned manifests. The /fields API endpoint returns a list of the fields that can be included in a manifest.

The counts, group_size, sql and page_size parameters will be described in subsequent sections.

Every row in the returned manifest will include one value for each of the above fields.

The POST /cohorts/manifest/preview API accepts both a fields list, and a cohort definition in the manifestPreviewBody. Here is an example manifestPreviewBody:

{
  "cohort_def": {
    "name": "mycohort",
    "description": "Example description",
    "filters": {
      "collection_id": [
        "TCGA_luad",
        "%_kirc"
      ],
      "Modality": [
        "CT",
        "MR"
      ],
      "Race": [
        "WHITE"
      ],
      "age_at_diagnosis_btw": [
        65,
        75
      ]
    }
  },
  "fields": [
    "Age_At_Diagnosis",
    "aws_bucket",
    "crdc_series_uuid",
    "Modality",
    "SliceThickness"
  ],
  "counts": true,
  "group_size": true,
  "sql": true,
  "page_size": 1000
}

This endpoint behaves like the following API sequence:

POST /cohorts    #Create a cohort
POST /cohorts/manifest/{cohort_id} # Get a manifest for the new cohort
DELETE /cohorts/{cohort_id} # Delete the new cohort

That is, it behaves as if a cohort is created, a manifest for that cohort is returned and the new cohort is deleted.

The /cohorts/manifest/{cohort_id} endpoint returns a manifestResponse JSON object and the /cohorts/manifest/preview returns a manifestPreviewResponse JSON object. Here is an example manifestResponse:

{
  "code": 200,
  "cohort_def": {
    "cohort_id": 23,
    "description": "Example description",
    "user_email": "somebody@somemail.com",
    "filterSet": {
      "filters": {
        "Modality": [
          "CT",
          "MR"
        ],
        "age_at_diagnosis_btw": [
          65,
          75
        ],
        "collection_id": [
          "tcga_luad",
          "%_kirc"
        ],
        "race": [
          "WHITE"
        ]
      },
      "idc_data_version": "16.0"
    },
    "name": "mycohort",
  },
  "manifest": {
    "manifest_data": [
      {
        "Modality": "MR",
        "SliceThickness": "10.0",
        "age_at_diagnosis": 66,
        "aws_bucket": "idc-open-data",
        "crdc_series_uuid": "09bc812b-53f7-48fc-8895-72f6b03f642b"
      },
      {
        "Modality": "CT",
        "SliceThickness": "2.5",
        "age_at_diagnosis": 66,
        "aws_bucket": "idc-open-data",
        "crdc_series_uuid": "102d676d-6c6f-4c20-bb36-77ec81b81b13"
      },
      {
        "Modality": "CT",
        "SliceThickness": "8.0",
        "age_at_diagnosis": 66,
        "aws_bucket": "idc-open-data",
        "crdc_series_uuid": "1d365f52-bff4-4348-a508-82d399ca8442"
      },   
      :
      {
        "Modality": "CT",
        "SliceThickness": "1000.090881",
        "age_at_diagnosis": 74,
        "aws_bucket": "idc-open-data",
        "crdc_series_uuid": "faa47e10-45df-44a7-9f8b-2923a41196b4"
      }
    ],
    "rowsReturned": 626,
    "totalFound": 626
  },
  "next_page": ""
}

The cohort definition is included so that the manifest is self-documenting. The manifest_data component of the manifest component contains a row for each distinct combination of the requested fields in the cohort. The idc_data_version in the cohort_def is the IDC version when the cohort was created. To generate the manifest, the cohort's filter is applied against the data in that IDC version.

The structure of the manifestPreviewResponse returned by the /cohorts/manifest/preview API endpoint is identical to the manifestResponse except that it does not have a cohort_id or user_email component.

Because the /cohorts/manifest/preview API endpoint is always applied against the current IDC version, the idc_data_version in the cohort_def is always that of the current IDC version.

The next_page value is described in the next section.

Groups and group_size

We use the term group to indicate the set of all instances in the cohort having the values of some row in the manifest. Thus the values of the first row above:

"Modality": "MR",
"SliceThickness": "10.0",
"age_at_diagnosis": 66,
"aws_bucket": "idc-open-data",
"crdc_series_uuid": "09bc812b-53f7-48fc-8895-72f6b03f642b"

implicitly define a group of instances in the cohort, each of which has those values.

When the group_size parameter in the manifestBody or manifestPreviewBody is true, the resulting manifest includes the total size in bytes of the instances in the corresponding group. Following is a fragment of the manifest for the same cohort above, but where the fields list includes group_size:

{
  "code": 200,
  "cohort_def": {
    "description": "Example description",
    "filterSet": {
      "filters": {
        "Modality": [
          "CT",
          "MR"
        ],
        "age_at_diagnosis_btw": [
          65,
          75
        ],
        "collection_id": [
          "tcga_luad",
          "tcga_kirc"
        ],
        "race": [
          "WHITE"
        ]
      },
      "idc_data_version": "16.0"
    },
    "name": "mycohort",
    "sql": ""
  },
  "next_page": "",
  "manifest": {
    "manifest_data": [
      {
        "Modality": "MR",
        "SliceThickness": "10.0",
        "age_at_diagnosis": 66,
        "aws_bucket": "idc-open-data",
        "crdc_series_uuid": "09bc812b-53f7-48fc-8895-72f6b03f642b",
        "group_size": 2690320
      },
      {
        "Modality": "CT",
        "SliceThickness": "2.5",
        "age_at_diagnosis": 66,
        "aws_bucket": "idc-open-data",
        "crdc_series_uuid": "102d676d-6c6f-4c20-bb36-77ec81b81b13",
        "group_size": 42818868
      },
      {
        "Modality": "CT",
        "SliceThickness": "8.0",
        "age_at_diagnosis": 66,
        "aws_bucket": "idc-open-data",
        "crdc_series_uuid": "1d365f52-bff4-4348-a508-82d399ca8442",
        "group_size": 20064536
      },
      :
      :
      {
        "Modality": "CT",
        "SliceThickness": "1000.090881",
        "age_at_diagnosis": 74,
        "aws_bucket": "idc-open-data",
        "crdc_series_uuid": "faa47e10-45df-44a7-9f8b-2923a41196b4",
        "group_size": 6518724
      }
    ],
    "rowsReturned": 626,
    "totalFound": 626
  },
  "next_page": ""
}

Here we see that the instances in the group corresponding to the first result row have a total size of 2,690,320B.

The totalFound value at the end of the manifest tells us that there are 626 rows in the manifest, meaning the manifest contains 626 different combinations of Modality, SliceThickness, age_at_diagnosis, aws_bucket, and crdc_series uuid. (The group size does not add to the combinatorics.) The rowsReturned value indicates that all the rows in the manifest were return in the first "page". If not all the rows had been returned, we can ask for additional "pages" as described in the next section.

The group_size parameter is optional and defaults to false .

Manifest granularity

If the counts parameter is true, the resulting manifest will selectively include counts of the instances, series, studies, patients and collections in each group. Which counts are included in a manifest is determined by the granularity and which, in turn, is determined by certain of the possible fields in the fields parameter list of the manifestBody or manifestPreviewBody.

For example, if the fields parameter list includes the SOPInstanceUID field, there will one group per instance in the manifest. Thus the manifest has instance granularity. A manifest has one of instance, series, study, patient, collection or version granularity.

For a given manifest granularity, and when counts is True, counts of the "lower level" objects are reported in the manifest. Thus, if a cohort has series granularity, then the count of all instances in each group is reported. If a cohort has study granularity, then the count of all instances in each group and of all series in each group is reported. And so on. This is described in detail in the remainder of this section.

In the following, manifest examples are based on this filterSet:

   "filters": {
      "collection_id": [
        "tcga_luad",
        "tcga_kirc"
      ],
      "Modality": [
        "CT",
        "MR"
      ],
      "Race": [
        "WHITE"
      ],
      "age_at_diagnosis_btw": [
        65,
        75
      ]
    }

Instance granularity

A manifest will have instance granularity if the fields parameter list includes one or both of the fields:

SOPInstanceUID
crdc_instance_uuid

Both of these fields are unique to each instance. Therefore the resulting manifest will include one row for each instance in the specified cohort. For example, the following fields list will result in a manifest having a row per instance:

{
  "fields": [
    "SOPInstanceUID",
    "Modality",
    "SliceThickness"
  ]
}

Each row will include the SOPInstanceUID, Modality and SliceThickness of the corresponding instance.

The counts parameter is ignored because there are no 'lower level' objects than instances,

Series granularity

A manifest will have series granularity if it goes not have instance granularity and the fields parameter list includes one or more of thee field:

SeriesInstanceUID
crdc_series_uuid

Both of these fields are unique to each series, and therefore the resulting manifest will include at least one row per series in the specified cohort. For example, the following fields list will result in a manifest having one or more rows per series:

"fields": [
  "Modality",
  "SliceThickness",
  "collection_id",
  "patientID",
  "StudyInstanceUID",
  "SeriesInstanceUID"
]

Because the SeriesInstanceUID is unique to each series in a cohort (more accurately, all instances in a series have the same SeriesInstanceUID), there will be at least one row per series in the resulting manifest. However, SliceThickness is not necessarily unique across all instance in a series. Therefore, the resulting manifest may have multiple rows for a given series...rows in which the SeriesInstanceUID is the same but the SliceThickness values differ. DICOM modality should always be the same for all instances in a series; therefore it is not expected to result in multiple rows per series.

If the counts parameter is true, each row of the manifest will have:

an instance_count value that is the count of instances in the group corresponding to the row

If the above fields then this is a fragment of the series granularity manifest of our example cohort:

{
  "code": 200,
  "cohort_def": {
    "description": "Example description",
    "filterSet": {
      "filters": {
        "Modality": [
          "CT",
          "MR"
        ],
        "age_at_diagnosis_btw": [
          65,
          75
        ],
        "collection_id": [
          "tcga_luad",
          "tcga_kirc"
        ],
        "race": [
          "WHITE"
        ]
      },
      "idc_data_version": "16.0"
    },
    "name": "mycohort",
    "sql": ""
  },
  "manifest": {
    "manifest_data": [
      {
        "Modality": "CT",
        "PatientID": "TCGA-50-6592",
        "SeriesInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.9002.141004994853145237754973938025",
        "SliceThickness": null,
        "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.9002.256822832756566055874151999412",
        "collection_id": "tcga_luad",
        "instance_count": "151"
      },
      {
        "Modality": "CT",
        "PatientID": "TCGA-50-6592",
        "SeriesInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.9002.212096199865546132848990878032",
        "SliceThickness": null,
        "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.9002.256822832756566055874151999412",
        "collection_id": "tcga_luad",
        "instance_count": "61"
      },
      {
        "Modality": "CT",
        "PatientID": "TCGA-50-6595",
        "SeriesInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.9002.829269157955398706933292266867",
        "SliceThickness": "0.578125",
        "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.9002.414530650520592976265083061155",
        "collection_id": "tcga_luad",
        "instance_count": "1"
      },
      :
      :
      {
        "Modality": "MR",
        "PatientID": "TCGA-B0-5109",
        "SeriesInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.4004.370888372270096165934432087127",
        "SliceThickness": "20.0",
        "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.4004.167173047835125001355984228239",
        "collection_id": "tcga_kirc",
        "instance_count": "50"
      }
    ],
    "rowsReturned": 742,
    "totalFound": 742
  }
  "next_page": ""
}

This tells us that the group of instances corresponding to the first row of the manifest results has 151 members.

Study Granularity

A manifest will have study granularity if it goes not have series or instance granularity and the queryFields list includes one or more of the fields:

StudyInstanceUID
crdc_study_uuid

Both of these fields are unique to each study, and therefore the resulting manifest will include at least one row per study in the specified cohort. For example, the following fields list will result in a manifest having a one or more rows per study:

"fields": [
    "Modality",
    "SliceThickness",
    "collection_id",
    "patientID",
    "StudyInstanceUID",
    "group_size",
    "counts"
]

Similarly, SliceThickness can vary not only among the instances in a series, but among series in a study. Therefore, the resulting manifest may have multiple rows for a study, and which differ from each other in both SliceThickness and Modality.

If counts is in the fields list, each row of the manifest will have:

an instance_count value that is the count of instances in the group corresponding to the row
a series_count value that is the count of series in the group corresponding to the row

If the fields list is as above, then this is a fragment of the study granularity manifest of our example cohort:

{
  "code": 200,
  "cohort_def": {
    "description": "Example description",
    "filterSet": {
      "filters": {
        "Modality": [
          "CT",
          "MR"
        ],
        "age_at_diagnosis_btw": [
          65,
          75
        ],
        "collection_id": [
          "tcga_luad",
          "tcga_kirc"
        ],
        "race": [
          "WHITE"
        ]
      },
      "idc_data_version": "16.0"
    },
    "name": "mycohort",
    "sql": ""
  },
  "manifest": {
    "manifest_data": [
      {
        "Modality": "CT",
        "PatientID": "TCGA-50-6592",
        "SliceThickness": null,
        "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.9002.256822832756566055874151999412",
        "collection_id": "tcga_luad",
        "instance_count": 212,
        "series_count": 2
      },
      {
        "Modality": "CT",
        "PatientID": "TCGA-50-6595",
        "SliceThickness": "0.578125",
        "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.9002.414530650520592976265083061155",
        "collection_id": "tcga_luad",
        "instance_count": 1,
        "series_count": 1
      },
      {
        "Modality": "CT",
        "PatientID": "TCGA-B8-4153",
        "SliceThickness": "0.6",
        "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.8421.4004.499780439902438461273732269226",
        "collection_id": "tcga_kirc",
        "instance_count": 2,
        "series_count": 1
      },
      :
      :
      {
        "Modality": "MR",
        "PatientID": "TCGA-B0-5109",
        "SliceThickness": "20.0",
        "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.4004.167173047835125001355984228239",
        "collection_id": "tcga_kirc",
        "instance_count": 100,
        "series_count": 2
      }
    ],
    "rowsReturned": 324,
    "totalFound": 324
  },
  "next_page": ""
}

This tells us that the group of instances corresponding to the first row of the manifest results has 212 members, divided among two series. The group of instances corresponding to the third row of the manifest results has two members in a single series.

Patient Granularity

A manifest will have patient granularity if it goes not have study, series or instance granularity and the fields list includes the field PatientID. This field is unique to each patient, and therefore the resulting manifest will include at least one row per patient in the specified cohort. For example, the following fields list will result in a manifest having a one or more rows per study:

"fields": [
    "Modality",
    "SliceThickness",
    "collection_id",
    "patientID",
    "group_size",
    "counts"
]

Because the PatientID is unique to each patient in a cohort (more accurately, all instances in a study have the same PatientID), there will be at least one row per patient in the resulting manifest. It is common for a patient's series to examine different body parts. Therefore, the resulting manifest may well have more than one row per patient.

If counts is in the fields list, each row of the manifest will have:

an instance_count value that is the count of instances in the group corresponding to the row
a series_count value that is the count of series in the group corresponding to the row
a study_count value that is the count of studies in the group corresponding to the row

If the fields list is as above, then this is a fragment of the patient granularity manifest of our example cohort:

{
  "code": 200,
  "cohort_def": {
    "description": "Example description",
    "filterSet": {
      "filters": {
        "Modality": [
          "CT",
          "MR"
        ],
        "age_at_diagnosis_btw": [
          65,
          75
        ],
        "collection_id": [
          "tcga_luad",
          "tcga_kirc"
        ],
        "race": [
          "WHITE"
        ]
      },
      "idc_data_version": "16.0"
    },
    "name": "mycohort",
    "sql": ""
  },
  "next_page": "",
  "manifest": {
    "manifest_data": [
     {
        "Modality": "CT",
        "PatientID": "TCGA-50-6592",
        "SliceThickness": null,
        "collection_id": "tcga_luad",
        "instance_count": "212",
        "series_count": "2",
        "study_count": "1"
      },
      {
        "Modality": "CT",
        "PatientID": "TCGA-50-6595",
        "SliceThickness": "0.578125",
        "collection_id": "tcga_luad",
        "instance_count": "1",
        "series_count": "1",
        "study_count": "1"
      },
      {
        "Modality": "CT",
        "PatientID": "TCGA-B8-4153",
        "SliceThickness": "0.6",
        "collection_id": "tcga_kirc",
        "instance_count": "6",
        "series_count": "2",
        "study_count": "2"
      },
      :
      :
      {
        "Modality": "MR",
        "PatientID": "TCGA-B0-5109",
        "SliceThickness": "20.0",
        "collection_id": "tcga_kirc",
        "instance_count": "100",
        "series_count": "2",
        "study_count": "1"
      }
    ],
    "rowsReturned": 301,
    "totalFound": 301
  }
}

This tells us that the group of instances corresponding to the first row of the manifest results has 212 members divided among two series, and both in a single study.

Collection Granularity

A manifest will have collection granularity if it goes not have patient, study, series or instance granularity and the fields parameter list includes the field collection_id. This field is unique to each collection, and therefore the resulting manifest will include at least one row per collection in the specified cohort. For example, the following fields list will result in a manifest having a one or more rows per study:

"fields": [
    "Modality",
    "SliceThickness",
    "collection_id",
    "patientID",
    "group_size",
    "counts"
]

Because the collection_id is unique to each collection in a cohort (more accurately, all instances in a collection have the same collection_id), there will be at least one row per collection in the resulting manifest. It is common for a collection to have patients of different ages. Therefore, the resulting manifest may well have more than one row per patient.

If the fields list is as follows:

then this is a fragment of the collection granularity manifest of our example cohort:

{
  "code": 200,
  "cohort_def": {
    "description": "Example description",
    "filterSet": {
      "filters": {
        "Modality": [
          "CT",
          "MR"
        ],
        "age_at_diagnosis_btw": [
          65,
          75
        ],
        "collection_id": [
          "tcga_luad",
          "tcga_kirc"
        ],
        "race": [
          "WHITE"
        ]
      },
      "idc_data_version": "16.0"
    },
    "name": "mycohort",
    "sql": ""
  },
  "manifest": {
    "manifest_data": [
      {
        "Modality": "CT",
        "SliceThickness": null,
        "collection_id": "tcga_luad"
        "instance_count": "212",
        "patient_count": "1",
        "series_count": "2",
        "study_count": "1"
      },
      {
        "Modality": "CT",
        "SliceThickness": "0.578125",
        "collection_id": "tcga_luad",
        "instance_count": "1",
        "patient_count": "1",
        "series_count": "1",
        "study_count": "1"
      },
      {
        "Modality": "CT",
        "SliceThickness": "0.6",
        "collection_id": "tcga_kirc",
        "instance_count": "29",
        "patient_count": "9",
        "series_count": "16",
        "study_count": "14"
      },
      :
      :
      {
        "Modality": "MR",
        "SliceThickness": "20.0",
        "collection_id": "tcga_kirc",
        "instance_count": "100",
        "patient_count": "1",
        "series_count": "2",
        "study_count": "1"
      }
    ],
    "rowsReturned": 88,
    "totalFound": 88
  }
  "next_page": "",
}

Version granularity

A manifest will have version granularity if it does not have collection, patient, study, series or instance granularity. At this granularity level, the rows in the manifest return the combinations of queried values across all collects, patients, studies, series and instances in the cohort.

When the fields list is as follows:

"fields": [
    "Modality",
    "SliceThickness",
    "patientID",
    "group_size",
    "counts"
]

then this is a fragment of the version granularity manifest of our example cohort:

{
  "code": 200,
  "cohort_def": {
    "description": "Example description",
    "filterSet": {
      "filters": {
        "Modality": [
          "CT",
          "MR"
        ],
        "age_at_diagnosis_btw": [
          65,
          75
        ],
        "collection_id": [
          "tcga_luad",
          "tcga_kirc"
        ],
        "race": [
          "WHITE"
        ]
      },
      "idc_data_version": "16.0"
    },
    "name": "mycohort",
    "sql": ""
  },
  "manifest": {
    "manifest_data": [
      {
        "Modality": "CT",
        "SliceThickness": null,
        "collection_count": "1",
        "instance_count": "212",
        "patient_count": "1",
        "series_count": "2",
        "study_count": "1"
      },
      {
        "Modality": "CT",
        "SliceThickness": "0.578125",
        "collection_count": "1",
        "instance_count": "1",
        "patient_count": "1",
        "series_count": "1",
        "study_count": "1"
      },
      {
        "Modality": "CT",
        "SliceThickness": "0.6",
        "collection_count": "2",
        "instance_count": "34",
        "patient_count": "11",
        "series_count": "19",
        "study_count": "17"
      },
      {
      :
      :
      {
        "Modality": "MR",
        "SliceThickness": "20.0",
        "collection_count": "1",
        "instance_count": "100",
        "patient_count": "1",
        "series_count": "2",
        "study_count": "1"
      }
    ],
    "rowsReturned": 87,
    "totalFound": 87
  }
  "next_page": "",
}

Row one of the results tells us that the cohort has 212 instances having a Null SliceThickness and modality="CT". Also, there are apparently 87 different combinations of Modality and SliceThickness in the cohort as shown by the totalFound value.

IDC Data Model Concepts

The IDC API is based on several IDC Data Model Concepts.

Cohorta

In IDC, a cohort is a set of subjects (DICOM patients) that are identified by applying a Filter Set to the Data Sources of some IDC data version. Because a cohort is defined with respect to an IDC data version, the set of subjects in the cohort, as well as all metadata associated with those subjects, is exactly and repeatably defined.

IDC Data Version

Over time, the set of data hosted by the IDC will change. For the most part, such changes will be due to new data having been added. The totality of IDC hosted data resulting from any such change is represented by a unique IDC data version ID. That is, each time that the set of publicly available data changes, a new IDC version is created that exactly defines the revised data set.

The IDC data version is intended to enable the reproducibility of research results. For example, consider a patient in the DICOM data model. Over time, new studies might be performed on a patient and become associated with that patient, and the corresponding DICOM instances will then be added to the IDC hosted data. Moreover, additional patients might well be added to the IDC data set over time. This means that the set of subjects defined by some filtering operation will change over time. Thus, for purposes of reproducibility, we define a cohort in terms of a set of filter groups and an IDC data version.

IDC cohort is uniquely defined by the combination of a set of filter groups and an IDC data version.

Note that on occasion some data might be removed from a collection, though this is expected to be rare. Such a removal will result in a new IDC data version which excludes that data. Such removed data will, however, continue to be available in any previous IDC data version in which it was available. There is one exception: data that is found to contain Personally Identifiable Information (PII) or Protected Health Information (PHI) will be removed from all IDC data versions.

Note: currently a cohort is always defined in terms of a single filter group and an IDC Data Version. In the future we may add support for multiple filter groups.

Filter Group

A filter group selects some set of subjects in the IDC hosted data, and is a set of conditions, where each condition is defined by an attribute and an array of values. An attribute identifies a field (column) in some data source (BQ table). Each filter group also specifies the IDC data version upon which it operates.

A filter group selects a subject if, for every attribute in the filter group, some datum associated with the subject satisfies one or more of the values in the associated array of values. A datum satisfies a value if it is equal to, less than, less than or equal to, between, greater than or equal to, or greater than, as required by the attribute. This is explained further below.

For example, the (attribute, [values]) pair (Modality, [MR, CT]) is satisfied if a subject "has" a Modality of MR or CT in any data associated with that subject. Thus, this (attribute, [values]) pair would be satisfied, for example, by a subject who has one or more MR series but no CT series.

Note that if a filter group includes more than one (attribute, [values]) pair having the same attribute, then only the last such (attribute, [values]) pair is used. Thus if a filter group includes the (attribute, [values]) pairs (Modality, [MR]) and (Modality, [CT]), in that order, only (Modality, [CT]) is used.

Here is an example filter group:

  {
    "idc_data_version": "1.0",
    "filters": {
      "collection_id": [
        "TCGA-LUAD",
        "TCGA-KIRC"
      ],
      "Modality": [
        "CT",
        "MR"
      ],
      "race": [
        "WHITE"
      ],
      "age_at_diagnosis_btw": [
        53, 69
      ]
    }

This filter group will select any subject in the TCGA-LUAD or TCGA-KIRC collections, if the subject has any DICOM instances having a modality of CT or MR, the subject's race is WHITE, and the subjects age at diagnosis is between 53 and 69.

Collections

A collection is a set of DICOM data provided by a single source. Collections are further categorized as Original collections or Analysis collections. Original collections are comprised primarily of DICOM image data that was obtained from some set of patients. Typically, the patients in an Original collection are related by a common disease.

Analysis collections are comprised of DICOM data that was generated by analyzing other (typically Original) collections. Typically such analysis is performed by a different entity than that which provided the original collection(s) on which the analysis is based. Examples of data in analysis collections include segmentations, annotations and further processing of original images. Note that some Original collections include such data, though most of the data in Original collections are original images.

Data Source

A data source is a BQ table that contains some part of the IDC metadata complement. API queries are performed against one or more such tables that are joined (in the relational database model sense). Data sources are classified as being of type Original, Derived or Related. Original data sources contain DICOM metadata from the DICOM objects in TCIA Original and TCIA Analysis collections. Derived data sources contain processed data: in general this is analytical data has been processed to enable easier SQL searches. Related data sources contain ancillary data that may be specific to some set of collections. For example, TCGA biospecimen and clinical data are maintained in such tables.

Data sources are versioned. That is, when the data in a data source changes, a new version of that set of data is defined. An IDC data version is defined in terms of a specific version of each data source. Note that over time, new data sources may be added (or, less likely, removed). Thus two IDC data versions may have a different number of data sources.

Attribute

Both the IDC Web App and API expose selected fields in the various data sources against which queries can be performed. Each attribute has a data type, one of:

Continuous Numeric An attribute with data type Continuous Number will have a numeric (float) value. For example, age_at_diagnosis is an attribute of data type Continuous Numeric. In order to enable relative numeric queries, the API exposes 6 variations of each Continuous Numeric attributes as filter set attribute names. These variations are the base attribute name with no suffix, as well as the base attribute name with one of the suffixes: _gt, _gte, _btw, _btwe, _ebtw, _ebtwe, _lte, _lt. The value array of the _*btw* variations must contain exactly two numeric values, in numeric order (least value first). The value array of the other variations must contain exactly one numeric values. The (attribute, value array) pair for a Continuous Numeric _attribute_ is satisfied according to the suffix as follows:
- <no suffix>: If an attribute is equal to the value in the value array
- gt: If an attribute is greater than the value in the value array
- gte: If an attribute is greater than or equal to the value in the value array
- btw: if an attribute is gt the first value and lt the second value in the value array
- ebtw: if an attribute is gte the first value and lt the second value in the value array
- btwe: if an attribute is gt the first value and lte the second value in the value array
- ebtwe: if an attribute is gte the first value and lte the second value in the value array
- lte: If an attribute is less than or equal to the value in the value array
- lt: If an attribute is less than the value in the value array
Categorical Numeric An attribute with data type Categorical Numeric has one of a defined set of numeric values. The corresponding value array must have a single numeric value.

Manifest

A manifest is a list of access methods and other metadata of the data objects corresponding to the objects in some cohort. There are two types of access methods:

GUID
>> curl https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects/bd68332e-521f-4c45-9a88-e9cc426f5a8d
{ "access_methods":[{ "access_id":"gs", "access_url":{ "url":"gs://idc-open/bd68332e-521f-4c45-9a88-e9cc426f5a8d.dcm" }, "region":"", "type":"gs" } ], "aliases":[ ], "checksums":[ { "checksum":"9a63c81a4b3b4bc3950678a4e9acc930", "type":"md5" } ], "contents":[ ], "created_time":"2021-08-27T21:15:02.385181", "description":null, "form":"object", "id":"dg.4DFC/bd68332e-521f-4c45-9a88-e9cc426f5a8d", "mime_type":"application/json", "name":"", "self_uri":"drs://nci-crdc.datacommons.io/dg.4DFC/bd68332e-521f-4c45-9a88-e9cc426f5a8d", "size":528622, "updated_time":"2021-08-27T21:15:02.385185", "version":"faf7385b" }
Resolving such a GUID returns a DrsObject. The access methods in the returned DrsObject include one or more URLs at which corresponding DICOM entities can be accessed. GUID manifests are recommended for long term archival and reference. In the above, we can see that the returned DrsObject includes the GCS URL gs://idc-open/bd68332e-521f-4c45-9a88-e9cc426f5a8d.dcm.
URL The URLs in a URL based manifest can be used to directly access a DICOM instance in Google Cloud Storage. URLs are structured as follows: gs://<GCS bucket>/<GUID>.dcm
This is a typical URL: gs://idc-open/bd68332e-521f-4c45-9a88-e9cc426f5a8d.dcm
Though rare, the URL of some object can change over time. In such a case, the corresponding DRSObject will be updated with new URL. However, the original URL will then be "stale".
Additional values can optionally be included in the returned manifest. See the manifest API descriptions for more details.

IDC API UI

Authenticating to the UI

Some of the API calls require authentication. This is denoted by a small lock symbol. Authentication can be performed by clicking on the ‘Authorize’ button at the top right of the page.

Make a Request

The API will return collection metadata for the current IDC data version. The request can be run by selecting ‘Execute’.

Request Response

The Swagger UI submits the request and shows the curl code that was submitted. The ‘Response body’ section will display the response to the request. The expected format of the response to this API request is shown below:

{
  "collections": [
    {
      "cancer_type": "string",
      "collection_id": "string",
      "date_updated": "string",
      "description": "string",
      "doi": "string",
      "image_types": "string",
      "location": "string",
      "species": "string",
      "subject_count": 0,
      "supporting_data": "string",
    }
  ],
  "code": 200
}

The actual JSON formatted response can be downloaded by selecting the ‘Download’ button.

The syntax for all of API data structures is detailed at the bottom of the UI page.

prod

Welcome!

Highlights

Functionality

Getting started

Explore the data available

Subset the content you need

Download the data you liked

Experiment with analysis tools

Scale the analysis to thousands of cloud VMs

Share analysis results or annotations

Questions?

Core functions

Easy and efficient access to public cancer imaging data

Tools to simplify the use of the data

Support of continuous enrichment of data

Integration of cancer imaging data with other components of CRDC

Frequently asked questions

How to download data from IDC?

How do I get my data into IDC?

How much does it cost to use the cloud?

What is the status of IDC?

What data is available?

How to acknowledge IDC?

What is the difference between IDC and TCIA?

Where do I learn more about other components of CRDC?

What about non-imaging data that accompanies IDC collections?

I want to search IDC content using an attribute not available in the portal

Support

Discounted use and training materials for NIH-funded investigators

Key pointers

Resources maintained by the IDC team

Other locations for accessing public imaging data

Publications

Publications by the IDC team

Publications referencing IDC (a subset)

IDC team

IDC Alumni

Acknowledgments

Jobs

IDC does not currently have open positions

Data

Introduction

Data sources

Data provenance

Data ingestion process

Organization of data

IDC data model

Files and metadata

BigQuery Tables and Views

IDC BigQuery datasets

IDC BigQuery tables

dicom_metadata

auxiliary_metadata

mutable_metadata

original_collections_metadata

analysis_results_metadata

version_metadata

dicom_all, dicom_all_view

segmentations, segmentations_view

measurement_groups, measurement_groups_view

qualitative_measurements, qualitative_measurements_view

quantitative_measurements, quantitative_measurements_view

dicom_metadata_curated, dicom_metadata_curated_view

dicom_metadata_curated_series_level, dicom_metadata_curated_series_level_view

idc_pivot_v<idc version>

Collection-specific BigQuery tables

TCGA

NLST

Storage Buckets

UIDs and UUIDs explained with an example

DICOM Stores

Tutorials

DICOM

Portal

API

Cookbook

Welcome!

Highlights

Functionality

`dicom_metadata`

`auxiliary_metadata`

`mutable_metadata`

`original_collections_metadata`

`analysis_results_metadata`

`version_metadata`

`dicom_all`, `dicom_all_view`

`segmentations`, `segmentations_view`

`measurement_groups`, `measurement_groups_view`

`qualitative_measurements`, `qualitative_measurements_view`

`quantitative_measurements`, `quantitative_measurements_view`

`dicom_metadata_curated`, `dicom_metadata_curated_view`

`dicom_metadata_curated_series_level`, `dicom_metadata_curated_series_level_view`

`idc_pivot_v<idc version>`

`dicom_metadata`

`auxiliary_metadata`

`mutable_metadata`

`original_collections_metadata`

`analysis_results_metadata`

`version_metadata`

`dicom_all`, `dicom_all_view`

`segmentations`, `segmentations_view`

`measurement_groups`, `measurement_groups_view`

`qualitative_measurements`, `qualitative_measurements_view`

`quantitative_measurements`, `quantitative_measurements_view`

`dicom_metadata_curated`, `dicom_metadata_curated_view`

`dicom_metadata_curated_series_level`, `dicom_metadata_curated_series_level_view`

`idc_pivot_v<idc version>`