Organization of data
IDC provides a variety of interfaces to access both the data (as files) and metadata (to subset files and build cohorts). The flow of data and the relationship between the various components IDC uses is summarized in the following figure.
We maintain the following resources to enable access to IDC data:
Cloud storage buckets: files maintained by IDC are mirrored between Google and AWS public storage buckets that provide fee-free egress without requiring login. The buckets organize files by DICOM series, each series stored in a separate folder. Given the large overall size of data in IDC, you will likely need to use one of the search interfaces to identify relevant series first.
BigQuery tables: collection-level metadata, DICOM metadata, clinical data tables available via SQL query interface.
Python API: pip-installable idc-index package provides programmatic interface and command-line tools to search IDC data using most important metadata attributes, and to download files corresponding to the selected cohorts from the cloud buckets
REST API: alternative language-independent API for selecting subsets of data
DICOMweb: DICOM files and metadata queries available from Google Healthcare DICOM stores
Last updated
Was this helpful?