IDC User Guide
  • Welcome!
  • 🚀Getting started
  • Core functions
  • Frequently asked questions
  • Support
  • Key pointers
  • Publications
  • IDC team
  • Acknowledgments
  • Jobs
  • Data
    • Introduction
    • Data model
    • Data versioning
    • Organization of data
      • Files and metadata
      • Resolving CRDC Globally Unique Identifiers (GUIDs)
      • Clinical data
      • Organization of data, v2 through V13 (deprecated)
        • Files and metadata
        • Resolving CRDC Globally Unique Identifiers (GUIDs)
        • Clinical data
      • Organization of data in v1 (deprecated)
    • Downloading data
      • Downloading data with s5cmd
      • Directly loading DICOM objects from Google Cloud or AWS in Python
    • Data release notes
    • Data known issues
  • Tutorials
    • Portal tutorial
    • Python notebook tutorials
    • Slide microscopy
      • Using QuPath for visualization
  • DICOM
    • Introduction to DICOM
    • DICOM data model
    • Original objects
    • Derived objects
      • DICOM Segmentations
      • DICOM Radiotherapy Structure Sets
      • DICOM Structured Reports
    • Coding schemes
    • DICOM-TIFF dual personality files
    • IDC DICOM white papers
  • Portal
    • Getting started
    • Exploring and subsetting data
      • Configuring your search
      • Exploring search results
      • Data selection and download
    • Visualizing images
    • Proxy policy
    • Viewer release notes
    • Portal release notes
  • API
    • Getting Started
    • IDC API Concepts
    • Manifests
    • Accessing the API
    • Endpoint Details
    • V1 API
      • Getting Started
      • IDC Data Model Concepts
      • Accessing the API
      • Endpoint Details
      • Release Notes
  • Cookbook
    • Colab notebooks
    • BigQuery
    • Looker dashboards
      • Dashboard for your cohort
      • More dashboard examples
    • ACCESS allocations
    • Compute engine
      • 3D Slicer desktop VM
      • Using a BQ Manifest to Load DICOM Files onto a VM
      • Using VS Code with GCP VMs
      • Security considerations
    • NCI Cloud Resources
Powered by GitBook
On this page
  • Storage Buckets
  • BigQuery Tables
  • DICOM Stores
  • BigQuery tables external to IDC

Was this helpful?

Edit on GitHub
Export as PDF
  1. Data
  2. Organization of data

Organization of data in v1 (deprecated)

v1 of IDC followed a different layout of data than subsequent version. Since the corresponding items are still available, we document it here for reference.

PreviousClinical dataNextDownloading data

Last updated 2 years ago

Was this helpful?

IDC approach to storage and management of DICOM data is relying on the Google Cloud Platform . We maintain three representations of the data, which are fully synchronized and correspond to the same dataset, but are intended to serve different use cases.

In order to access the resources listed below, it is assumed you have completed the to access Google Cloud console!

All of the resources listed below are accessible under the .

Storage Buckets

Storage Buckets are basic containers in Google Cloud that provide storage for data objects (you can read more about the relevant terms in the Google Cloud Storage documentation ).

Storage buckets are named using the format idc-tcia-<TCIA_COLLECTION_NAME>, where TCIA_COLLECTION_NAME corresponds to the collection name in the collections table here.

Within the bucket, DICOM files are organized using the following directory naming conventions:

dicom/<StudyInstanceUID>/<SeriesInstanceUID>/<SOPInstanceUID>.dcm

where *InstanceUIDs correspond to the respective value of the DICOM attributes in the stored DICOM files.

You can read about accessing GCP storage buckets from a Compute VM .

Egress of IDC data out of the cloud is free, since IDC data is participating in Google Public Datasets Program!

Assuming you have a list of GCS URLs in gcs_paths.txt, you can download the corresponding items using the command below, substituting $PROJECT_ID with the valid GCP Project ID (see the complete example in ):

$ cat gcs_paths.txt | gsutil -u $PROJECT_ID -m cp -I .

BigQuery Tables

Google is a massively-parallel analytics engine ideal for working with tabular data. Data stored in BQ can be accessed using queries.

Due to the existing limitations of Google Healthcare API, not all of the DICOM attributes are extracted and are available in BigQuery tables. Specifically:

  • sequences that contain around 1MiB of data are dropped from BigQuery export and RetrieveMetadata output currently. 1MiB is not an exact limit, but it can be used as a rough estimate of whether or not the API will drop the tag (this limitation was not documented as of writing this) - we know that some of the instances in IDC will be affected by this limitation. The fix for this limitation is targeted for sometime in 2021, according to the communication with Google Healthcare support.

IDC users can access this table to conduct detailed exploration of the metadata content, and build cohorts using fine-grained controls not accessible from the IDC portal.

In addition to the DICOM metadata tables, we maintain several additional tables that curate metadata non-DICOM metadata (e.g., attribution of a given item to a specific collection and DOI, collection-level metadata, etc).

DICOM Stores

BigQuery tables external to IDC

In addition to the DICOM data, some of the image-related data hosted by IDC is stored in additional tables. These include the following:

IDC utilizes the standard capabilities of the Google Healthcare API to extract all of the DICOM metadata from the hosted collections into a single BQ table. Conventions of how DICOM attributes of various types are converted into BQ form are covered in the Healthcare API documentation article.

sequences that have more than 15 levels of nesting are not extracted (see ) - we believe this limitation does not affect the data stored in IDC

: DICOM metadata for all of the data hosted by IDC

: collection-level metadata for the original TCIA data collections hosted by IDC, for the most part corresponding to the content available in

`` : collection-level metadata for the TCIA analysis collections hosted by IDC, for the most part corresponding to the content available in

In addition to the tables above, we provide the following (virtual tables defined by queries) that extract specific subsets of metadata, or combine attributes across different tables, for convenience of the users

: DICOM metadata together with the collection-level metadata

``: attributes of the segments stored in DICOM Segmentation object

: measurement group sequences extracted from the DICOM SR TID1500 objects

: coded evaluation results extracted from the DICOM SR TID1500 objects

: quantitative evaluation results extracted from the DICOM SR TID1500 objects

IDC MVP utilizes a single Google Healthcare DICOM store to host all of the collections. That store, however, is primarily intended to support visualization of the data using OHIF Viewer. At this time, we do not support access of the hosted data via DICOMWeb interface by the IDC users. See more details in the , and please comment about your use case if you have a need to access data via the DICOMweb interface.

BigQuery TCGA clinical data: . Note that this table is hosted under the ISB-CGC Google project, as documented , and its location may change in the future!

Healthcare API
"getting started" steps
canceridc-data GCP project
here
here
this notebook
BigQuery (BQ)
standard SQL
Understanding the BigQuery DICOM schema
https://cloud.google.com/bigquery/docs/nested-repeated
canceridc-data.idc.dicom_metadata
canceridc-data.idc.data_collections_metadata
this table at TCIA
canceridc-data.idc.analysis_collections_metadata
this table at TCIA
BigQuery views
canceridc-data.idc_views.dicom_all
canceridc-data.idc_views.segmentations
canceridc-data.idc_views.measurement_groups
canceridc-data.idc_views.qualitative_measurements
canceridc-data.idc_views.quantitative_measurements
discussion here
isb-cgc:TCGA_bioclin_v0.clinical_v1
here