IDC User Guide
  • Welcome!
  • 🚀Getting started
  • Core functions
  • Frequently asked questions
  • Support
  • Key pointers
  • Publications
  • IDC team
  • Acknowledgments
  • Jobs
  • Data
    • Introduction
    • Data model
    • Data versioning
    • Organization of data
      • Files and metadata
      • Resolving CRDC Globally Unique Identifiers (GUIDs)
      • Clinical data
      • Organization of data, v2 through V13 (deprecated)
        • Files and metadata
        • Resolving CRDC Globally Unique Identifiers (GUIDs)
        • Clinical data
      • Organization of data in v1 (deprecated)
    • Downloading data
      • Downloading data with s5cmd
      • Directly loading DICOM objects from Google Cloud or AWS in Python
    • Data release notes
    • Data known issues
  • Tutorials
    • Portal tutorial
    • Python notebook tutorials
    • Slide microscopy
      • Using QuPath for visualization
  • DICOM
    • Introduction to DICOM
    • DICOM data model
    • Original objects
    • Derived objects
      • DICOM Segmentations
      • DICOM Radiotherapy Structure Sets
      • DICOM Structured Reports
    • Coding schemes
    • DICOM-TIFF dual personality files
    • IDC DICOM white papers
  • Portal
    • Getting started
    • Exploring and subsetting data
      • Configuring your search
      • Exploring search results
      • Data selection and download
    • Visualizing images
    • Proxy policy
    • Viewer release notes
    • Portal release notes
  • API
    • Getting Started
    • IDC API Concepts
    • Manifests
    • Accessing the API
    • Endpoint Details
    • V1 API
      • Getting Started
      • IDC Data Model Concepts
      • Accessing the API
      • Endpoint Details
      • Release Notes
  • Cookbook
    • Colab notebooks
    • BigQuery
    • Looker dashboards
      • Dashboard for your cohort
      • More dashboard examples
    • ACCESS allocations
    • Compute engine
      • 3D Slicer desktop VM
      • Using a BQ Manifest to Load DICOM Files onto a VM
      • Using VS Code with GCP VMs
      • Security considerations
    • NCI Cloud Resources
Powered by GitBook
On this page
  • Limited access content
  • BigQuery Tables
  • Collection-specific BigQuery tables
  • DICOM Stores
  • BigQuery tables external to IDC

Was this helpful?

Edit on GitHub
Export as PDF
  1. Data
  2. Organization of data
  3. Organization of data, v2 through V13 (deprecated)

Files and metadata

PreviousOrganization of data, v2 through V13 (deprecated)NextResolving CRDC Globally Unique Identifiers (GUIDs)

Last updated 2 years ago

Was this helpful?

Limited access content

As discussed in this community forum post, from public access collections to limited access. At the moment, we still keep those files that used to be public in IDC before the decision made by TCIA, and the metadata for those files is still accessible in our BigQuery tables, but you cannot download those “Limited” access files referenced by gcs_url from IDC.

As discussed in the issue will manifest itself in an error accessing gcs_url that corresponds to a non-public file:

AccessDeniedException: 403 <user email> does not have storage.objects.list 
access to the Google Cloud Storage bucket.

has a column named access , which takes values Public or Limited that define if the file corresponding to the instance can be accessed. For all practical purposes, if you interact with the IDC BigQuery tables, you should make sure you exclude “Limited” access items using the following clause in your query:

SELECT
  ...
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  access <> "Limited"

In a future release of IDC we will by default exclude limited access items from what you select in the portal, so the portal selection should be more intuitive. But if you access the data via BigQuery queries you will need to know that “Limited” are not accessible and account for this in your query.Storage Buckets

Storage Buckets are basic containers in Google Cloud that provide storage for data objects (you can read more about the relevant terms in the Google Cloud Storage documentation

All IDC DICOM file data for all IDC data versions and all of the are maintained in Google Cloud Storage (GCS). Currently all DICOM files are maintained in GCS buckets that allow for free egress within or out of the cloud, enabled through the partnership of IDC with .

The object namespace is flat, where every object name is composed of a standard format CRDC UUID and with the ".dcm" file extension, e.g. 905c82fd-b1b7-4610-8808-b0c8466b4dee.dcm. For example, that instance can be accessed using as gs://idc-open/905c82fd-b1b7-4610-8808-b0c8466b4dee.dcm

You can read about accessing GCP storage buckets from a Compute VM .

Egress of IDC data out of the cloud is free, since IDC data is participating in !

$ cat gcs_paths.txt | gsutil -m cp -I .

BigQuery Tables

The flat address space of IDC DICOM objects in GCS storage is accompanied by BigQuery tables that allow the researcher to reconstruct the DICOM hierarchy as it exists for any given version. There are also several BQ tables and views in which we keep copies of the metadata exposed via the TCIA interface at the time a version was captured and other pertinent information.

There is an instance of each of the following tables and views per IDC version. The set of tables and views corresponding to an IDC version are collected in a single BQ dataset per IDC version, bigquery-public-data.idc_<idc_version_number> where bigquery-public-data is the project in which the dataset is hosted. As an example, the BQ tables for IDC version 4 are in the bigquery-public-data.idc_v4dataset.

In addition to the per-version datasets, the bigquery-public-data.idc-current dataset consists of a set of BQ views. There is a view for each table or view in the BQ data set corresponding to the current IDC release. Each such view in bigquery-public-data.idc-current is named identically to some table or view in the bigquery-public-data.idc_<idc_version_number> dataset of the current IDC release and can be used to access that table or view.

Several Google BigQuery (BQ) tables support searches against metadata extracted from the data files. Additional BQ tables define the composition of each IDC data version.

We maintain several additional tables that curate metadata non-DICOM metadata (e.g., attribution of a given item to a specific collection and DOI, collection-level metadata, etc).

    • tcia_api_collection_id: The ID, as accepted by the TCIA API, of the original data collection containing this instance

    • idc_webapp_collection_id:The ID, as accepted by the IDC web app, of the original data collection containing this instance

    • collection_timestamp: Datetime when the IDC data in the collection was last revised

    • source_doi:A DOI of the TCIA wiki page corresponding to the original data collection or analysis results that is the source of this instance

    • collection_hash: The md5 hash of the sorted patient_hashes of all patients in the collection containing this instance

    • collection_init_idc_version: The IDC version in which the collection containing this instance first appeared

    • collection_revised_idc_version: The IDC version in which the collection containing this instance was most recently revised

    Patient attributes:

    • submitter_case_id:The submitter’s (of data to TCIA) ID of the patient containing this instance. This is the DICOM PatientID

    • idc_case_id:IDC generated UUID that uniquely identifies the patient containing this instance

      This is needed because DICOM PatientIDs are not required to be globally unique

    • patient_hash: the md5 hash of the sorted study_hashes of all studies in the patient containing this instance

    • patient_init_idc_version: The IDC version in which the patient containing this instance first appeared

    • patient_revised_idc_version: The IDC version in which the patient containing this instance was most recently revised

    Study attributes:

    • StudyInstanceUID: DICOM UID of the study containing this instance

    • study_uuid:IDC assigned UUID that identifies a version of the study containing this instance.

    • study_instances: The number instances in the study containing this instance

    • study_hash: the md5 hash of the sorted series_hashes of all series in study containing this instance

    • study_init_idc_version: The IDC version in which the study containing this instance first appeared

    • study_revised_idc_version: The IDC version in which the study containing this instance was most recently revised

    Series attributes:

    • SeriesInstanceUID: DICOM UID of the series containing this instance

    • series_uuid:IDC assigned UUID that identifies a version of the series containing this instance

    • source_doi:A DOI of the TCIA wiki page corresponding to the original data collection or analysis results that is the source of this instance

    • series_instances: The number of instances in the series containing this instance

    • series_hash: the md5 hash of the sorted instance_hashes of all instance in the series containing this instance

    • series_init_idc_version: The IDC version in which the series containing this instance first appeared

    • series_revised_idc_version: The IDC version in which the series containing this instance was most recently revised

    Instance attributes:

    • SOPInstanceUID: DICOM UID of this instance.

    • instance_uuid:IDC assigned UUID that identifies a version of this instance.

    • gcs_url: The GCS URL of a file containing the version of this instance that is identified by the instance_uuid

    • instance_hash: the md5 hash of the version of this instance that is identified by the instance_uuid

    • instance_size: the size, in bytes, of this version of the instance that is identified by the instance_uuid

    • instance_init_idc_version: The IDC version in which this instance first appeared

    • instance_revised_idc_version: The IDC version in which this instance was most recently revised

    • license_url: The URL of a web page that describes the license governing this instance

    • license_long_name: A long form name of the license governing this instance

    • license_short_name: A short form name of the license governing this instance

Due to the existing limitations of Google Healthcare API, not all of the DICOM attributes are extracted and are available in BigQuery tables. Specifically:

  • sequences that contain around 1MiB of data are dropped from BigQuery export and RetrieveMetadata output currently. 1MiB is not an exact limit, but it can be used as a rough estimate of whether or not the API will drop the tag (this limitation was not documented as of writing this) - we know that some of the instances in IDC will be affected by this limitation. The fix for this limitation is targeted for sometime in 2021, according to the communication with Google Healthcare support.

    • tcia_api_collection_id: The collection ID as is accepted by the TCIA AP

    • tcia_wiki_collection_id: The collection ID as on the TCIA wiki page

    • idc_webapp_collection_id:The collection ID as accepted by the IDC web app

    • Program: The program to which this collection belongs

    • Updated: Moser recent update date reported by TCIA

    • Status:Collection status" Ongoing or complete

    • Access:Collection access conditions: Limited or Public

    • ImageType: Enumeration of image types/modalities in the collection

    • Subjects:Number of subjects in the collection

    • DOI:DOI that can be resolved at doi.org to the TCIA wiki page for this collection

    • CancerType:TCIA assigned cancer type of this collection

    • SupportingData:Type(s) of additional data available

    • Species: Species of collection subjects

    • Location:Body location that was studied

    • Description:TCIA description of the collection (HTML format)

    • license_url: The URL of a web page that describes the license governing this collection

    • license_long_name: A long form name of the license governing this collection

    • license_short_name: A short form name of the license governing this collection

    • ID: Results ID

    • Title: Descriptive title

    • DOI:DOI that can be resolved at doi.org to the TCIA wiki page for this analysis result

    • CancerType:TCIA assigned cancer type of this analysis result

    • Location:Body location that was studied

    • Subjects:Number of subjects in the analysis result

    • Collections: Original collections studied

    • AnalysisArtifactsonTCIA: Type(s) of analysis artifacts generated

    • Updated: Data when results were last updated

    • license_url: The URL of a web page that describes the license governing this collection

    • license_long_name: A long form name of the license governing this collection

    • license_short_name: A short form name of the license governing this collection

  • cancer-idc.idc_v<version_number>.version_metadata (also available via the canceridc-data.idc-current.version_metadata view for the current version of IDC data). Metadata for each IDC version, one row per row:

    • idc_version: IDC version number

    • version_hash: MD5 hash of hashes of collections in this version

    • version_timestamp: Version creation timestamp

  • view for the current version of IDC data) Measurement group sequences extracted from the DICOM SR TID1500 objects

The following tables contain TCGA-specific metadata:

  • tcga_biospecimen_rel9: biospecimen metadata

  • tcga_clinical_rel9: clinical metadata

Collection-specific BigQuery tables

Some of the collections are accompanied by BigQuery tables that have not been harmonized to a single data model. Those tables are available within the BigQuery dataset corresponding to a given release, and will have the name prefix corresponding to the short name of the collection. The list below discusses those collection-specific tables.

NLST

DICOM Stores

BigQuery tables external to IDC

In addition to the DICOM data, some of the image-related data hosted by IDC is stored in additional tables. These include the following:

Typically, the user would not interact with the storage buckets to select and copy files (unless the intent is to copy the entire content hosted by IDC). Instead, one should use either the IDC Portal or IDC BigQuery tables containing file metadata, to identify items of interest and define a cohort. The cohort manifest generated by the IDC Portal can include both the Google Storage URLs for the corresponding files in the bucket, and the , which can be resolved to the Google Storage URLs to access the files.

Assuming you have a list of GCS URLs in a file gcs_paths.txt, you can download the corresponding items using the command below, substituting $PROJECT_ID with the valid GCP Project ID (see the complete example in ):

Google is a massively-parallel analytics engine ideal for working with tabular data. Data stored in BQ can be accessed using queries.

bigquery-public-data.idc_v<idc_version_number>.auxiliary_metadata (also available via the view.) This table defines the contents of the corresponding IDC version. There is a row for each instance in the version. Collection attributes:

bigquery-public-data.idc_v<idc_version_number>.dicom_metadata (also available via view for the current version of IDC data) DICOM metadata for each instance in the corresponding IDC version. IDC utilizes the standard capabilities of the Google Healthcare API to extract all of the DICOM metadata from the hosted collections into a single BQ table. Conventions of how DICOM attributes of various types are converted into BQ form are covered in the Google Healthcare API documentation article. IDC users can access this table to conduct detailed exploration of the metadata content, and build cohorts using fine-grained controls not accessible from the IDC portal. The schema is too large to document here. Refer to the BQ table and the above referenced documentation.

sequences that have more than 15 levels of nesting are not extracted (see ) - we believe this limitation does not affect the data stored in IDC

bigquery-public-data.idc_v<idc_version_number>.original_collections_metadata (also available via the view) This table is comprised of IDC data Collection-level metadata for the original TCIA data collections hosted by IDC, for the most part corresponding to the content available in . One row per collection:

bigquery-public-data.idc_v<idc_version_number>.analysis_results_metadata (also available via the view for the current version of IDC data) Metadata for the TCIA analysis results hosted by IDC, for the most part corresponding to the content available in . One row per analysis result:

The following (virtual tables defined by queries) extract specific subsets of metadata, or combine attributes across different tables, for convenience of the users

bigquery-public-data.idc_v<idc_version_number>.dicom_all (also available via view for the current version of IDC data) DICOM metadata together with selected auxiliary and collection metadata

bigquery-public-data.idc_v<idc_version_number>.segmentations (also available via view for the current version of IDC data) Attributes of the segments stored in DICOM Segmentation objects

bigquery-public-data.idc_v<idc_version_number>.measurement_groups (also available via``

bigquery-public-data.idc_v<idc_version_number>.qualitative_measurements (also available via view for the current version of IDC data) Coded evaluation results extracted from the DICOM SR TID1500 objects

bigquery-public-data.idc_v<idc_version_number>.quantitative_measurements (also available via view for the current version of IDC data) Quantitative evaluation results extracted from the DICOM SR TID1500 objects

IDC hosts a subset of the NLST clinical data, which was cleared for public sharing. If you need the full clinical data, please visit the .

The following tables contain NLST specific metadata. The detailed schema of those tables is available from the .

``: "Lung Cancer"

``: "SCT Abnormalities"

``: "SCT Comparison Abnormalities"

``: "Participant"

``: "SCT Screening"

IDC utilizes a single Google Healthcare DICOM store to host all of the instances in the current IDC version. That store, however, is primarily intended to support visualization of the data using OHIF Viewer. At this time, we do not support access of the hosted data via DICOMWeb interface by the IDC users. See more details in the , and please comment about your use case if you have a need to access data via the DICOMweb interface.

BigQuery TCGA clinical data: . Note that this table is hosted under the ISB-CGC Google project, as documented , and its location may change in the future!

TCIA made the decision to pull a subset of data
this post
bigquery-public-data.idc_current.dicom_all table
here
collections hosted by IDC
Google Public Datasets Program
gsutil
here
Google Public Datasets Program
CRDC UUIDs
this notebook
BigQuery (BQ)
standard SQL
bigquery-public-data.idc_current.auxiliary_metadata
bigquery-public-data.idc_current.dicom_metadata
Understanding the BigQuery DICOM schema
https://cloud.google.com/bigquery/docs/nested-repeated
bigquery-public-data.idc_current.original_collections_metadata
this table at TCIA
bigquery-public-data.idc_current.analysis_results_metadata
this table at TCIA
BigQuery views
bigquery-public-data.idc_current.dicom_all
bigquery-public-data.idc_current.segmentations
bigquery-public-data.idc_current.measurement_groups
bigquery-public-data.idc_current.qualitative_measurements
bigquery-public-data.idc_current.quantitative_measurements
Cancer Data Access System (CDAS) system
TCIA NLST collection page
nlst_canc
nlst_ctab
nlst_ctabc
nlst_prsn
nlst_screen
discussion here
isb-cgc:TCGA_bioclin_v0.clinical_v1
here