Files and metadata
We gratefully acknowledge Google Public Data Program and the AWS Open Data Sponsorship Program that support public hosting of IDC-curated content, and cover out-of-cloud egress fees!
Let's start with the overall principles of how we organize data in IDC.
IDC brings you (as of v18) over 60 TB of publicly available DICOM images and image-derived content. We share those with you as DICOM files, and those DICOM files are available in cloud-based storage buckets - both in Google and AWS.
Sharing just the files, however, is not particularly helpful. With that much data, it is no longer practical to just download all of those files to later sort through them to select those you need.
Think of IDC as a library, where each file is a book. With that many books, it is not feasible to read them all, or even open each one to understand what is inside. Libraries are of little use without a catalog!
To provide you with a catalog of our data, along with the files, we maintain metadata that makes it possible to understand what is contained within files, and select the files that are of interest for your project, so that you can download just the files you need. We make that metadata available in BigQuery tables searchable using standard SQL.
In the following we describe organization of both the metadata catalog and the buckets containing the files. As you go over this documentation, please consider completing our "Getting started" tutorial - it will give you the opportunity to apply the knowledge you gain by reading this article while interacting with the data, and should help better understand this content.
BigQuery Tables and Views
Google BigQuery (BQ) is a massively-parallel analytics engine ideal for working with tabular data. Data stored in BQ can be accessed using standard SQL queries.
IDC utilizes BigQuery tables to organize metadata accompanying the files we host. If you have never worked with BigQuery before, you need to understand the basics of data organization in BQ.
BQ tables are organized in BQ datasets. BQ datasets are not unlike folders on your computer, but contain tables related to each other instead of files. BQ datasets, in turn, are organized under Google Cloud projects. GCP projects can be thought of as containers that are managed by a particular organization. To continue with the file system analogy, think about projects as hard drives that contain folders.
This may be a good time for you to complete Part 1 of the IDC "Getting started" tutorial series, so that you are able to open the tables and datasets we will be discussing in the following paragraphs!
Let's map the aforementioned project-dataset-table hierarchy to the concrete locations that contain IDC data.
IDC BigQuery datasets
All of the IDC tables are stored under the bigquery-public-data
project. That project is managed by Google Public Datasets Program, and contains many public BQ datasets, beyond those maintained by IDC.
All of the IDC tables are organized into datasets by data release version. If you complete the tutorial mentioned above, open the BQ console, and scroll down the list of datasets, you will find those that are named starting with the idc_v
prefix - those are IDC datasets.
Following the prefix, you will find the number that corresponds to the IDC data release version. IDC data releases version numbers start from 1 and are incremented by one for each subsequent release. As of writing this, the most recent version of IDC is 16, and you can find dataset idc_v16
corresponding to this version.
In addition to idc_v16
you will find a dataset named idc_v16_clinical
. That dataset contains clinical data accompanying IDC collections. We started clinical data ingestion in IDC v11. If you want to learn more about the organization and searching of clinical data, take a look at the clinical data documentation.
Finally, you will also see two special datasets: idc_current
and idc_current_clinical
. Those two datasets are essentially aliases, or links, to the versioned datasets corresponding to the latest release of IDC data.
If you want to explore the latest content of IDC - use current
datasets.
If you want to make sure your queries and data selection are reproducible - always use the version numbered datasets!
IDC BigQuery tables
Before we dive into discussing the individual tables maintained by IDC, there is just one more BigQuery-specific concept you need to learn: the view. BigQuery view is a table that is defined by an SQL query that is run every time you query the view (you can read more about BQ views in this article).
BQ views can be very handy when you want to simplify your queries by factoring out the part of the query that is often reused. But a key disadvantage of BQ views over tables is the reduced performance and increased cost due to re-running the query each time you query the view.
As we will discuss further, most of the tables maintained by IDC are created by joining and/or post-processing other tables. Because of this we rely heavily on BQ views to improve transparency of the provenance of those "derived" tables. BQ views can be easily distinguished from the tables in a given dataset by a different icon. IDC datasets also follow a convention that all views in the versioned datasets include suffix _view
in the name, and are accompanied by the result of running the query used by the view in a table that has the same name sans the _view
suffix. See the figure below for an illustration of this convention.
If you are ever curious (and you should be, at least once in a while!) about the queries behind individual views, you can click on the view in the BQ console, and see the query in the "Details" tab. Try this out yourself to check the query for dicom_all_view
Now that we reviewed the main concepts behind IDC tables organization, it is time to explain the sources of metadata contained in those tables. Leaving _clinical
datasets aside, IDC tables are populated from one of the two sources:
DICOM metadata extracted from the DICOM files hosted by IDC, and various derivative tables that simplify access to specific DICOM metadata items;
collection-level and auxiliary metadata, which is not stored in DICOM tags, but is either received by IDC from other sources, or is populated by IDC as part of data curation (these include Digital Object Identifiers, description of the collections, hashsums, etc).
The set of BQ tables and views has grown over time. The enumeration below documents the BQ tables and views as of IDC v14. Some of these tables will not be found in earlier IDC BigQuery datasets.
dicom_metadata
dicom_metadata
Table in BigQuery: dicom_metadata
Each row in the dicom_metadata
table holds the DICOM metadata of an instance in the corresponding IDC version. There is a single row for each DICOM instance in the corresponding IDC version, and the columns correspond to the DICOM attributes encountered in the data across all of the ingested instances.
IDC utilizes the standard capabilities of the Google Healthcare API to extract all of the DICOM metadata from the hosted collections into a single BQ table. Conventions of how DICOM attributes of various types are converted into BQ form are covered in the Understanding the BigQuery DICOM schema Google Healthcare API documentation article.
dicom_metadata
table contains DICOM metadata extract from the files included in the given IDC data release. The amount and variety of the DICOM files grows with the new releases, and the schema of this table reflects the organization of the metadata in each IDC release. Non-sequence attributes, such as Modality
or SeriesInstanceUID
, once encountered in any one file will result in the corresponding column being introduced to the table schema (i.e., if we have column X
in IDC release 11, in all likelihood it will also be present in all of the subsequent releases).
Sequence DICOM attributes, however, may have content that is highly variable across different DICOM instances (especially in Structured Reports). Those attributes will map to STRUCT
BQ SQL type, and it is not unusual to see drastic differences in the corresponding columns of the table between different releases.
dicom_metadata
can be used to conduct detailed explorations of the metadata content, and build cohorts using fine-grained controls not accessible from the IDC portal. Note that the dicom_all
table, described below, is probably a better choice for such explorations.
Due to the existing limitations of Google Healthcare API, not all of the DICOM attributes are extracted and are available in BigQuery tables. Specifically:
sequences that have more than 15 levels of nesting are not extracted (see https://cloud.google.com/bigquery/docs/nested-repeated) - we believe this limitation does not affect the data stored in IDC
sequences that contain around 1MiB of data are dropped from BigQuery export and RetrieveMetadata output currently. 1MiB is not an exact limit, but it can be used as a rough estimate of whether or not the API will drop the tag (this limitation was not documented as of writing this) - we know that some of the instances in IDC will be affected by this limitation. The fix for this limitation is targeted for sometime in 2021, according to the communication with Google Healthcare support.
auxiliary_metadata
auxiliary_metadata
Table in BigQuery: auxiliary_metadata
This table defines the contents of the corresponding IDC version. There is a row for each instance in the version. We group the attributes for convenience:
Collection attributes:
tcia_api_collection_id:
The ID, as accepted by the TCIA API, of the original data collection containing this instance (will be Null for collections not sourced from TCIA)idc_webapp_collection_id:
The ID, as accepted by the IDC web app, of the original data collection containing this instancecollection_id:
The ID, as accepted by the IDC web app. Duplicate ofidc_webapp_collection_id
collection_timestamp:
Datetime when the IDC data in the collection was last revisedcollection_hash
: md5 hash of the of this version of the collection containing this instancecollection_init_idc_version:
The IDC version in which the collection containing this instance first appearedcollection_revised_idc_version:
The IDC version in which this version of the collection containing this instance first appeared
Patient attributes:
submitter_case_id:
The Patient ID assigned by the submitter of this data. This is the same as the DICOM PatientIDidc_case_id:
IDC generated UUID that uniquely identifies the patient containing this instanceThis is needed because DICOM PatientIDs are not required to be globally unique
patient_hash
: md5 hash of this version of the patient/case containing this instancepatient_init_idc_version:
The IDC version in which the patient containing this instance first appearedpatient_revised_idc_version:
The IDC version in which this version of the patient/case containing this instance first appeared
Study attributes:
StudyInstanceUID:
DICOM UID of the study containing this instancestudy_uuid:
IDC assigned UUID that identifies a version the the study containing this instance.study_instances:
The number instances in the study containing this instancestudy_hash
: md5 hash of the data in this version of the study containing this instancestudy_init_idc_version:
The IDC version in which the study containing this instance first appearedstudy_revised_idc_version:
The IDC version in which this version of the study containing this instance first appeared
Series attributes:
SeriesInstanceUID:
DICOM UID of the series containing this instanceseries_uuid:
IDC assigned UUID that identifies the version of the series containing this instancesource_doi:
The DOI of an information page corresponding to the original data collection or analysis results that is the source of this instancesource_url:
The URL of an information page that describes the original collection or analysis result that is the source of this instanceseries_instances:
The number of instances in the series containing this instanceseries_hash
: md5 hash of the data in the this version of the series containing this instanceaccess:
Collection access status: 'Public' or 'Limited'. (Currently all data is 'Public')series_init_idc_version:
The IDC version in which the series containing this instance first appearedseries_revised_idc_version:
The IDC version in which this version of the series containing this instance first appeared
Instance attributes:
SOPInstanceUID:
DICOM UID of this instance.instance_uuid:
IDC assigned UUID that identifies the version of this instance.gcs_url:
The GCS URL of a file containing the version of this instance that is identified by thisseries_uuid/instance_uuid
aws_url:
The AWS URL of a file containing the version of this instance that is identified by thisseries_uuid/instance_uuid
instance_hash
: the md5 hash of this version of this instanceinstance_size:
the size, in bytes, of this version of this instanceinstance_init_idc_version:
The IDC version in which this instance first appearedinstance_revised_idc_version:
The IDC version in which this version of this instance first appearedlicense_url:
The URL of a web page that describes the license governing this version of this instancelicense_long_name:
A long form name of the license governing this version of this instancelicense_short_name:
A short form name of the license governing this version of this instance
mutable_metadata
mutable_metadata
Table in BigQuery: mutable_metadata
Some non-DICOM metadata may change over time. This includes the GCS and AWS URLs of instance data, the accessibility of each instance and the URL of an instance's associated description page. BigQuery metadata tables such as the auxiliary_metadata and dicom_all tables are never revised even when such metadata changes. However, tables in the datasets of previous IDC versions can be joined with the mutable_metadata table to obtain the current values of these mutable attributes.
The table has one row for each version of each instances:
crdc_instance_uuid
: The uuid of an instance versioncrdc_series_uuid
: The uuid of a series version that contains this instance versioncrdc_study_uuid
: The uuid of a study version that contains the series versiongcs_url
: URL to the Google Cloud Storage (GCS) object containing this instance versionaws_url
: URL to the Amazon Web Services (AWS) object containing this instance version`access: Current access status of this instance (Public or Limited)
source_url
: The URL of a page that describes the original collection or analysis result that includes this instancesource_doi
: The DOI of a page that describes the original collection or analysis result that includes this instance
original_collections_metadata
original_collections_metadata
Table in BigQuery: original_collections_metadata
This table is comprised of IDC data collection-level metadata for the original TCIA data collections hosted by IDC, for the most part corresponding to the content available in this table at TCIA. One row per collection:
tcia_api_collection_id:
The collection ID as is accepted by the TCIA APtcia_wiki_collection_id:
The collection ID as on the TCIA wiki pageidc_webapp_collection_id:
The collection ID as accepted by the IDC web appProgram:
The program to which this collection belongsUpdated:
Most recent update date reported by the collection sourceStatus:
Collection status: "Ongoing" or "Complete"Access:
Collection access conditions: "Limited" or "Public"ImageType:
Enumeration of image types/modalities in the collectionSubjects:
Number of subjects in the collectionDOI:
DOI that can be resolved at doi.org to the TCIA wiki page for this collectionURL:
URL of an information page for this collectionCancerType:
Collection source(s) assigned cancer type of this collectionSupportingData:
Type(s) of additional data availableSpecies:
Species of collection subjectsLocation:
Body location that was studiedDescription:
Description of the collection (HTML format)license_url:
The URL of a web page that describes the license governing this collectionlicense_long_name:
A long form name of the license governing this collectionlicense_short_name:
A short form name of the license governing this collection
analysis_results_metadata
analysis_results_metadata
Table in BigQuery: analysis_results_metadata
Metadata for the TCIA analysis results hosted by IDC, for the most part corresponding to the content available in this table at TCIA. One row per analysis result:
ID:
Results IDTitle:
Descriptive titleDOI:
DOI that can be resolved at doi.org to the TCIA wiki page for this analysis resultCancerType:
TCIA assigned cancer type of this analysis resultLocation:
Body location that was studiedSubjects:
Number of subjects in the analysis resultCollections:
Original collections studiedAnalysisArtifactsonTCIA:
Type(s) of analysis artifacts generatedUpdated:
Data when results were last updatedlicense_url:
The URL of a web page that describes the license governing this collectionlicense_long_name:
A long form name of the license governing this collectionlicense_short_name:
A short form name of the license governing this collectiondescription:
Description of analysis result
version_metadata
version_metadata
Table in BigQuery: version_metadata
Metadata for each IDC version, one row per version:
idc_version
: IDC version numberversion_hash
: MD5 hash of hashes of collections in this versionversion_timestamp
: Version creation timestamp
The following tables and views consist of metadata derived from one or more other IDC tables tables for convenience of the user. For each such table, <table_name>
, there is also a corresponding view, <table_name>_view
, that, when queried, generates an equivalent table. These views are intended as a reference; each view's SQL is available to be used for further investigation.
Several of these tables/views are discussed more completely here.
dicom_all
, dicom_all_view
dicom_all
, dicom_all_view
Table in BigQuery: dicom_all
Table in BigQuery: dicom_all_view
All columns from dicom_metadata
together with selected date from the auxiliary_metadata
, original_collections_metadata
, and analysis_results_metadata
tables.
segmentations
, segmentations_view
segmentations
, segmentations_view
Table in BigQuery: segmentations
Table in BigQuery: segmentations_view
This table is derived from dicom_all
to simplify access to the attributes of DICOM Segmentation objects available in IDC. Each row in this table corresponds to one DICOM Segmentation instance segment.
measurement_groups
, measurement_groups_view
measurement_groups
, measurement_groups_view
Table in BigQuery: measurement_groups
Table in BigQuery: measurement_groups_view
This table is derived from dicom_all
to simplify access to the measurement groups encoded in DICOM Structured Report TID 1500 objects available in IDC. Specifically, this table contains measurement groups corresponding to the "Measurement group" content item in the TID 1500 Measurement report DICOM SR objects.
Each row corresponds to one TID1500 measurement group.
qualitative_measurements
, qualitative_measurements_view
qualitative_measurements
, qualitative_measurements_view
Table in BigQuery: qualitative_measurements
Table in BigQuery: qualitative_measurements_view
This table is derived from dicom_all
to simplify access to the qualitative measurements in DICOM SR TID1500 objects. It contains coded evaluation results extracted from the DICOM SR TID1500 objects. Each row in this table corresponds to a single qualitative measurement extracted.
quantitative_measurements
, quantitative_measurements_view
quantitative_measurements
, quantitative_measurements_view
Table in BigQuery: quantitative_measurements
Table in BigQuery: quantitative_measurements_view
This table is derived from dicom_all
to simplify access to the quantitative measurements in DICOM SR TID1500 objects. It contains quantitative evaluation results extracted from the DICOM SR TID1500 objects. Each row in this table corresponds to a single quantitative measurement extracted.
dicom_metadata_curated
, dicom_metadata_curated_view
dicom_metadata_curated
, dicom_metadata_curated_view
Table in BigQuery: dicom_metadata_curated
Table in BigQuery: dicom_metadata_curated_view
Curated values of DICOM metadata extracted from dicom_metadata
.
dicom_metadata_curated_series_level
, dicom_metadata_curated_series_level_view
dicom_metadata_curated_series_level
, dicom_metadata_curated_series_level_view
Table in BigQuery: dicom_metadata_curated_series_lev
Table in BigQuery: dicom_metadata_curated_series_level_view
Curated columns from dicom_metadata
that have been aggregated/cleaned up to describe content at the series level. Each row in this table corresponds to a DICOM instance in IDC. The columns are curated by defining queries that apply transformations to the original values of DICOM attributes.
idc_pivot_v<idc version>
idc_pivot_v<idc version>
A view that is the basis for the queries performed by the IDC web app.
Collection-specific BigQuery tables
Most clinical data is found in the idc_v<idc_version>_clinical datasets. However, a few tables of clinical data are found in the idc_v<idc_version> datasets.
TCGA
The following tables contain TCGA-specific metadata:
tcga_biospecimen_rel9:
biospecimen metadatatcga_clinical_rel9:
clinical metadata
NLST
IDC hosts a subset of the NLST clinical data, which was cleared for public sharing. If you need the full clinical data, please visit the Cancer Data Access System (CDAS) system.
The following tables contain NLST specific metadata. The detailed schema of those tables is available from the TCIA NLST collection page.
nlst_canc
: "Lung Cancer"nlst_ctab
: "SCT Abnormalities"nlst_ctabc
: "SCT Comparison Abnormalities"nlst_prsn
: "Participant"nlst_screen
: "SCT Screening"
Storage Buckets
All IDC DICOM file data for all IDC data versions across all of the collections hosted by IDC are maintained in Google Cloud Storage (GCS) and AWS S3 (S3) buckets. Currently all DICOM files are maintained in buckets that allow for free egress within or out of the cloud. This is enabled through the partnership of IDC with Google Public Data Program and the AWS Open Data Sponsorship Program.
Note that only (versions of) DICOM instances have associated files (as discussed in DICOM Data Model. There are no per-series or per-study files.
The object namespace is hierarchical, where, for each version of a DICOM instance having instance UUID <instance_uuid> in a version of a series version having UUID <series_uuid>, the file name is:
<series_uuid>/<instance_uuid>.dcm
Corresponding files have the same object name in GCS and S3, though the name of the containing buckets will be different.
UIDs and UUIDs explained with an example
Consider an instance in the CPTAC-CM collection that has this SOPInstanceUID
: 1.3.6.1.4.1.5962.99.1.171941254.777277241.1640849481094.35.0
\
It is in a series having this SeriesInstanceUID
:
1.3.6.1.4.1.5962.99.1.171941254.777277241.1640849481094.2.0
The instance and series were added to the IDC Data set in IDC version 7. At that point, the instance was assigned UUID:
5dce0cf0-4694-4dff-8f9e-2785bf179267
and the series was assigned this UUID:
e127d258-37c2-47bb-a7d1-1faa7f47f47a
In IDC version 10, a revision of this instance was added (keeping its original SOPInstanceUID
), and assigned this UUID:
21e5e9ce-01f5-4b9b-9899-a2cbb979b542
Because this instance was revised, the series containing it was implicitly revised. The revised series was thus issued a new UUID:
ee34c840-b0ca-4400-a6c8-c605cef17630
Thus, the initial version of this instance has this file name:
e127d258-37c2-47bb-a7d1-1faa7f47f47a/5dce0cf0-4694-4dff-8f9e-2785bf179267.dcm
and the revised version of the instance has the this file name:
ee34c840-b0ca-4400-a6c8-c605cef17630/21e5e9ce-01f5-4b9b-9899-a2cbb979b542.dcm
Both versions of the instance are in both AWS and GCS buckets.
Note that GCS and AWS bucket names are different. In fact, DICOM instance data is distributed across multiple buckets in both GCS and AWS. We will discuss obtaining GCS and AWS URLs more a little later.
It is possible that a series is revised, but one or more instances in the series are not revised. For example if a single instance in a series (assume the series has a uuid <series_uuid_old>) is revised, that instance gets a new UUID, and there is implicitly a new version of the series, which gets a new UUID (call it <series_uuid_new>). If an instance that is not revised has UUID <invariant_instance_uuid>, then its corresponding file in cloud storage will the have name:
<series_uuid_old>/<invariant_instance_uuid>.dcm
in the "old" series. But, because that same instance version is in the revised series, there must also be a file in cloud storage named:
<series_uuid_new>/<invariant_instance_uuid>.dcm
The result will be two distinct but identical files.
Utilities like gsutil, s3 and s5cmd "understand" the implied hierarchy in these file names. Thus the series UUID now acts like the name of a directory that contains all the instance versions in the series version:
and similarly for AWS buckets, thus making it easy to transfer all instances in a series from the cloud.
Because file names are more or less opaque, the user will not typically select files by listing the contents of a bucket. Instead, one should use either the IDC Portal or IDC BigQuery tables to identify items of interest and, then, generate a manifest of objects that can be passed to a utility like s5cmd.
DICOM Stores
IDC utilizes a single Google Healthcare DICOM store to host all of the instances in the current IDC version. That store, however, is primarily intended to support visualization of the data using the OHIF and Slim viewers. At this time, we do not support access of the hosted data via DICOMWeb interface by IDC users. See more details in the discussion here, and please comment about your use case if you have a need to access data via the DICOMweb interface.
Last updated