BigQuery tables
BQ tables are organized in BQ datasets. BQ datasets are not unlike folders on your computer, but contain tables related to each other instead of files. BQ datasets, in turn, are organized under Google Cloud projects. GCP projects can be thought of as containers that are managed by a particular organization. To continue with the file system analogy, think about projects as hard drives that contain folders.
This may be a good time for you to complete Part 1 of the IDC "Getting started" tutorial series, so that you are able to open the tables and datasets we will be discussing in the following paragraphs!
Let's map the aforementioned project-dataset-table hierarchy to the concrete locations that contain IDC data.
IDC BigQuery datasets
All of the IDC tables are stored under the bigquery-public-data
project. That project is managed by Google Public Datasets Program, and contains many public BQ datasets, beyond those maintained by IDC.
All of the IDC tables are organized into datasets by data release version. If you complete the tutorial mentioned above, open the BQ console, and scroll down the list of datasets, you will find those that are named starting with the idc_v
prefix - those are IDC datasets.

Following the prefix, you will find the number that corresponds to the IDC data release version. IDC data releases version numbers start from 1 and are incremented by one for each subsequent release. As of writing this, the most recent version of IDC is 16, and you can find dataset idc_v16
corresponding to this version.
In addition to idc_v16
you will find a dataset named idc_v16_clinical
. That dataset contains clinical data accompanying IDC collections. We started clinical data ingestion in IDC v11. If you want to learn more about the organization and searching of clinical data, take a look at the clinical data documentation.
Finally, you will also see two special datasets: idc_current
and idc_current_clinical
. Those two datasets are essentially aliases, or links, to the versioned datasets corresponding to the latest release of IDC data.
If you want to explore the latest content of IDC - use current
datasets.
If you want to make sure your queries and data selection are reproducible - always use the version numbered datasets!
IDC BigQuery tables
Before we dive into discussing the individual tables maintained by IDC, there is just one more BigQuery-specific concept you need to learn: the view. BigQuery view is a table that is defined by an SQL query that is run every time you query the view (you can read more about BQ views in this article).
BQ views can be very handy when you want to simplify your queries by factoring out the part of the query that is often reused. But a key disadvantage of BQ views over tables is the reduced performance and increased cost due to re-running the query each time you query the view.
As we will discuss further, most of the tables maintained by IDC are created by joining and/or post-processing other tables. Because of this we rely heavily on BQ views to improve transparency of the provenance of those "derived" tables. BQ views can be easily distinguished from the tables in a given dataset by a different icon. IDC datasets also follow a convention that all views in the versioned datasets include suffix _view
in the name, and are accompanied by the result of running the query used by the view in a table that has the same name sans the _view
suffix. See the figure below for an illustration of this convention.

dicom_all_view
is a BQ view, as indicated by the icon to the left from the table name. dicom_all
table is the result of running the query that defines the dicom_all_view
. If you are ever curious (and you should be, at least once in a while!) about the queries behind individual views, you can click on the view in the BQ console, and see the query in the "Details" tab. Try this out yourself to check the query for dicom_all_view

Now that we reviewed the main concepts behind IDC tables organization, it is time to explain the sources of metadata contained in those tables. Leaving _clinical
datasets aside, IDC tables are populated from one of the two sources:
DICOM metadata extracted from the DICOM files hosted by IDC, and various derivative tables that simplify access to specific DICOM metadata items;
collection-level and auxiliary metadata, which is not stored in DICOM tags, but is either received by IDC from other sources, or is populated by IDC as part of data curation (these include Digital Object Identifiers, description of the collections, hashsums, etc).
The set of BQ tables and views has grown over time. The enumeration below documents the BQ tables and views as of IDC v14. Some of these tables will not be found in earlier IDC BigQuery datasets.
dicom_metadata
dicom_metadata
Each row in the dicom_metadata
table holds the DICOM metadata of an instance in the corresponding IDC version. There is a single row for each DICOM instance in the corresponding IDC version, and the columns correspond to the DICOM attributes encountered in the data across all of the ingested instances.
IDC utilizes the standard capabilities of the Google Healthcare API to extract all of the DICOM metadata from the hosted collections into a single BQ table. Conventions of how DICOM attributes of various types are converted into BQ form are covered in the Understanding the BigQuery DICOM schema Google Healthcare API documentation article.
dicom_metadata
table contains DICOM metadata extract from the files included in the given IDC data release. The amount and variety of the DICOM files grows with the new releases, and the schema of this table reflects the organization of the metadata in each IDC release. Non-sequence attributes, such as Modality
or SeriesInstanceUID
, once encountered in any one file will result in the corresponding column being introduced to the table schema (i.e., if we have column X
in IDC release 11, in all likelihood it will also be present in all of the subsequent releases).
Sequence DICOM attributes, however, may have content that is highly variable across different DICOM instances (especially in Structured Reports). Those attributes will map to STRUCT
BQ SQL type, and it is not unusual to see drastic differences in the corresponding columns of the table between different releases.
dicom_metadata
can be used to conduct detailed explorations of the metadata content, and build cohorts using fine-grained controls not accessible from the IDC portal. Note that the dicom_all
table, described below, is probably a better choice for such explorations.
Due to the existing limitations of Google Healthcare API, not all of the DICOM attributes are extracted and are available in BigQuery tables. Specifically:
sequences that have more than 15 levels of nesting are not extracted (see https://cloud.google.com/bigquery/docs/nested-repeated) - we believe this limitation does not affect the data stored in IDC
sequences that contain around 1MiB of data are dropped from BigQuery export and RetrieveMetadata output currently. 1MiB is not an exact limit, but it can be used as a rough estimate of whether or not the API will drop the tag (this limitation was not documented as of writing this) - we know that some of the instances in IDC will be affected by this limitation. The fix for this limitation is targeted for sometime in 2021, according to the communication with Google Healthcare support.
auxiliary_metadata
auxiliary_metadata
This table defines the contents of the corresponding IDC version. There is a row for each instance in the version. We group the attributes for convenience:
Collection attributes:
tcia_api_collection_id:
The ID, as accepted by the TCIA API, of the original data collection containing this instance (will be Null for collections not sourced from TCIA)idc_webapp_collection_id:
The ID, as accepted by the IDC web app, of the original data collection containing this instancecollection_id:
The ID, as accepted by the IDC web app. Duplicate ofidc_webapp_collection_id
collection_timestamp:
Datetime when the IDC data in the collection was last revisedcollection_hash
: md5 hash of the of this version of the collection containing this instancecollection_init_idc_version:
The IDC version in which the collection containing this instance first appearedcollection_revised_idc_version:
The IDC version in which this version of the collection containing this instance first appeared
Patient attributes:
submitter_case_id:
The Patient ID assigned by the submitter of this data. This is the same as the DICOM PatientIDidc_case_id:
IDC generated UUID that uniquely identifies the patient containing this instanceThis is needed because DICOM PatientIDs are not required to be globally unique
patient_hash
: md5 hash of this version of the patient/case containing this instancepatient_init_idc_version:
The IDC version in which the patient containing this instance first appearedpatient_revised_idc_version:
The IDC version in which this version of the patient/case containing this instance first appeared
Study attributes:
StudyInstanceUID:
DICOM UID of the study containing this instancestudy_uuid:
IDC assigned UUID that identifies a version the the study containing this instance.study_instances:
The number instances in the study containing this instancestudy_hash
: md5 hash of the data in this version of the study containing this instancestudy_init_idc_version:
The IDC version in which the study containing this instance first appearedstudy_revised_idc_version:
The IDC version in which this version of the study containing this instance first appeared
Series attributes:
SeriesInstanceUID:
DICOM UID of the series containing this instanceseries_uuid:
IDC assigned UUID that identifies the version of the series containing this instancesource_doi:
The DOI of an information page corresponding to the original data collection or analysis results that is the source of this instancesource_url:
The URL of an information page that describes the original collection or analysis result that is the source of this instanceseries_instances:
The number of instances in the series containing this instanceseries_hash
: md5 hash of the data in the this version of the series containing this instanceaccess:
Collection access status: 'Public' or 'Limited'. (Currently all data is 'Public')series_init_idc_version:
The IDC version in which the series containing this instance first appearedseries_revised_idc_version:
The IDC version in which this version of the series containing this instance first appeared
Instance attributes:
SOPInstanceUID:
DICOM UID of this instance.instance_uuid:
IDC assigned UUID that identifies the version of this instance.gcs_url:
The GCS URL of a file containing the version of this instance that is identified by thisseries_uuid/instance_uuid
aws_url:
The AWS URL of a file containing the version of this instance that is identified by thisseries_uuid/instance_uuid
instance_hash
: the md5 hash of this version of this instanceinstance_size:
the size, in bytes, of this version of this instanceinstance_init_idc_version:
The IDC version in which this instance first appearedinstance_revised_idc_version:
The IDC version in which this version of this instance first appearedlicense_url:
The URL of a web page that describes the license governing this version of this instancelicense_long_name:
A long form name of the license governing this version of this instancelicense_short_name:
A short form name of the license governing this version of this instance
mutable_metadata
mutable_metadata
Some non-DICOM metadata may change over time. This includes the GCS and AWS URLs of instance data, the accessibility of each instance and the URL of an instance's associated description page. BigQuery metadata tables such as the auxiliary_metadata and dicom_all tables are never revised even when such metadata changes. However, tables in the datasets of previous IDC versions can be joined with the mutable_metadata table to obtain the current values of these mutable attributes.
The table has one row for each version of each instances:
crdc_instance_uuid
: The uuid of an instance versioncrdc_series_uuid
: The uuid of a series version that contains this instance versioncrdc_study_uuid
: The uuid of a study version that contains the series versiongcs_url
: URL to the Google Cloud Storage (GCS) object containing this instance versionaws_url
: URL to the Amazon Web Services (AWS) object containing this instance version`access: Current access status of this instance (Public or Limited)
source_url
: The URL of a page that describes the original collection or analysis result that includes this instancesource_doi
: The DOI of a page that describes the original collection or analysis result that includes this instance
original_collections_metadata
original_collections_metadata
This table is comprised of IDC data collection-level metadata for the original TCIA data collections hosted by IDC, for the most part corresponding to the content available in this table at TCIA. One row per collection:
tcia_api_collection_id:
The collection ID as is accepted by the TCIA APtcia_wiki_collection_id:
The collection ID as on the TCIA wiki pageidc_webapp_collection_id:
The collection ID as accepted by the IDC web appProgram:
The program to which this collection belongsUpdated:
Most recent update date reported by the collection sourceStatus:
Collection status: "Ongoing" or "Complete"Access:
Collection access conditions: "Limited" or "Public"ImageType:
Enumeration of image types/modalities in the collectionSubjects:
Number of subjects in the collectionDOI:
DOI that can be resolved at doi.org to the TCIA wiki page for this collectionURL:
URL of an information page for this collectionCancerType:
Collection source(s) assigned cancer type of this collectionSupportingData:
Type(s) of additional data availableSpecies:
Species of collection subjectsLocation:
Body location that was studiedDescription:
Description of the collection (HTML format)license_url:
The URL of a web page that describes the license governing this collectionlicense_long_name:
A long form name of the license governing this collectionlicense_short_name:
A short form name of the license governing this collection
analysis_results_metadata
analysis_results_metadata
Metadata for the TCIA analysis results hosted by IDC, for the most part corresponding to the content available in this table at TCIA. One row per analysis result:
ID:
Results IDTitle:
Descriptive titleDOI:
DOI that can be resolved at doi.org to the TCIA wiki page for this analysis resultCancerType:
TCIA assigned cancer type of this analysis resultLocation:
Body location that was studiedSubjects:
Number of subjects in the analysis resultCollections:
Original collections studiedAnalysisArtifactsonTCIA:
Type(s) of analysis artifacts generatedUpdated:
Data when results were last updatedlicense_url:
The URL of a web page that describes the license governing this collectionlicense_long_name:
A long form name of the license governing this collectionlicense_short_name:
A short form name of the license governing this collectiondescription:
Description of analysis result
version_metadata
version_metadata
Metadata for each IDC version, one row per version:
idc_version
: IDC version numberversion_hash
: MD5 hash of hashes of collections in this versionversion_timestamp
: Version creation timestamp
The following tables and views consist of metadata derived from one or more other IDC tables tables for convenience of the user. For each such table, <table_name>
, there is also a corresponding view, <table_name>_view
, that, when queried, generates an equivalent table. These views are intended as a reference; each view's SQL is available to be used for further investigation.
Several of these tables/views are discussed more completely here.
dicom_all
, dicom_all_view
dicom_all
, dicom_all_view
All columns from dicom_metadata
together with selected date from the auxiliary_metadata
, original_collections_metadata
, and analysis_results_metadata
tables.
segmentations
, segmentations_view
segmentations
, segmentations_view
This table is derived from dicom_all
to simplify access to the attributes of DICOM Segmentation objects available in IDC. Each row in this table corresponds to one DICOM Segmentation instance segment.
measurement_groups
, measurement_groups_view
measurement_groups
, measurement_groups_view
This table is derived from dicom_all
to simplify access to the measurement groups encoded in DICOM Structured Report TID 1500 objects available in IDC. Specifically, this table contains measurement groups corresponding to the "Measurement group" content item in the TID 1500 Measurement report DICOM SR objects.
Each row corresponds to one TID1500 measurement group.
qualitative_measurements
, qualitative_measurements_view
qualitative_measurements
, qualitative_measurements_view
This table is derived from dicom_all
to simplify access to the qualitative measurements in DICOM SR TID1500 objects. It contains coded evaluation results extracted from the DICOM SR TID1500 objects. Each row in this table corresponds to a single qualitative measurement extracted.
quantitative_measurements
, quantitative_measurements_view
quantitative_measurements
, quantitative_measurements_view
This table is derived from dicom_all
to simplify access to the quantitative measurements in DICOM SR TID1500 objects. It contains quantitative evaluation results extracted from the DICOM SR TID1500 objects. Each row in this table corresponds to a single quantitative measurement extracted.
dicom_metadata_curated
, dicom_metadata_curated_view
dicom_metadata_curated
, dicom_metadata_curated_view
Curated values of DICOM metadata extracted from dicom_metadata
.
dicom_metadata_curated_series_level
, dicom_metadata_curated_series_level_view
dicom_metadata_curated_series_level
, dicom_metadata_curated_series_level_view
Curated columns from dicom_metadata
that have been aggregated/cleaned up to describe content at the series level. Each row in this table corresponds to a DICOM instance in IDC. The columns are curated by defining queries that apply transformations to the original values of DICOM attributes.
idc_pivot_v<idc version>
idc_pivot_v<idc version>
A view that is the basis for the queries performed by the IDC web app.
Collection-specific BigQuery tables
Most clinical data is found in the idc_v<idc_version>_clinical datasets. However, a few tables of clinical data are found in the idc_v<idc_version> datasets.
TCGA
The following tables contain TCGA-specific metadata:
tcga_biospecimen_rel9:
biospecimen metadatatcga_clinical_rel9:
clinical metadata
NLST
The following tables contain NLST specific metadata. The detailed schema of those tables is available from the TCIA NLST collection page.
nlst_canc
: "Lung Cancer"nlst_ctab
: "SCT Abnormalities"nlst_ctabc
: "SCT Comparison Abnormalities"nlst_prsn
: "Participant"nlst_screen
: "SCT Screening"
Last updated
Was this helpful?