IDC approach to storage and management of DICOM data is relying on the Google Cloud Platform Healthcare API. We maintain three representations of the data, which are fully synchronized and correspond to the same dataset, but are intended to serve different use cases.
All of the resources listed below are accessible under the
canceridc-data GCP project.
Storage buckets are named using the format
TCIA_COLLECTION_NAME corresponds to the collection name in the collections table here.
Within the bucket, DICOM files are organized using the following directory naming conventions:
*InstanceUIDs correspond to the respective value of the DICOM attributes in the stored DICOM files.
You can read about accessing GCP storage buckets from a Compute VM here.
All of the IDC buckets are requester-pays, which means you will need to provide Project ID for a project that has billing set up if you want to download the data from those buckets.
Assuming you have a list of GCS URLs in
gcs_paths.txt, you can download the corresponding items using the command below, substituting
$PROJECT_ID with the valid GCP Project ID (see the complete example in this notebook):
$ cat gcs_paths.txt | gsutil -u $PROJECT_ID -m cp -I .
Google BigQuery (BQ) is a massively-parallel analytics engine ideal for working with tabular data. IDC utilizes the standard capabilities of the Google Healthcare API to extract all of the DICOM metadata from the hosted collections into a single BQ table. Conventions of how DICOM attributes of various types are converted into BQ form are covered in the Understanding the BigQuery DICOM schema Healthcare API documentation article.
IDC users can access this table to conduct detailed exploration of the metadata content, and build cohorts using fine-grained controls not accessible from the IDC portal.
In addition to the DICOM metadata tables, we maintain several additional tables that curate metadata non-DICOM metadata (e.g., attribution of a given item to a specific collection and DOI, collection-level metadata, etc).
canceridc-data.idc.dicom_metadata: DICOM metadata for all of the data hosted by IDC
In addition to the tables above, we provide the following BigQuery views (virtual tables defined by queries) that extract specific subsets of metadata, or combine attributes across different tables, for convenience of the users
canceridc-data.idc_views.dicom_all: DICOM metadata together with the collection-level metadata
canceridc-data.idc_views.segmentations: attributes of the segments stored in DICOM Segmentation object
canceridc-data.idc_views.measurement_groups: measurement group sequences extracted from the DICOM SR TID1500 objects
canceridc-data.idc_views.qualitative_measurements: coded evaluation results extracted from the DICOM SR TID1500 objects
canceridc-data.idc_views.quantitative_measurements: quantitative evaluation results extracted from the DICOM SR TID1500 objects
IDC MVP utilizes a single Google Healthcare DICOM store to host all of the collections. That store, however, is primarily intended to support visualization of the data using OHIF Viewer. At this time, we do not support access of the hosted data via DICOMWeb interface by the IDC users. See more details in the discussion here, and please comment about your use case if you have a need to access data via the DICOMweb interface.
In addition to the DICOM data, some of the image-related data hosted by IDC is stored in additional tables. These include the following: