Files and metadata
Let's start with the overall principles of how we organize data in IDC.
IDC brings you (as of v21) over 85 TB of publicly available DICOM images and image-derived content. We share those with you as DICOM files, and those DICOM files are available in cloud-based storage buckets - both in Google and AWS.
Sharing just the files, however, is not particularly helpful. With that much data, it is no longer practical to just download all of those files to later sort through them to select those you need.
Think of IDC as a library, where each file is a book. With that many books, it is not feasible to read them all, or even open each one to understand what is inside. Libraries are of little use without a catalog!
To provide you with a catalog of our data, along with the files, we maintain metadata that makes it possible to understand what is contained within files, and select the files that are of interest for your project, so that you can download just the files you need.
In the following we describe organization of both the storage buckets containing the files, and the metadata catalog that you can use to select files that meet your needs. As you go over this documentation, please consider completing our "Getting started" tutorial - it will give you the opportunity to apply the knowledge you gain by reading this article while interacting with the data, and should help better understand this content.
Storage Buckets
All IDC DICOM file data for all IDC data versions across all of the collections hosted by IDC are mirrored between Google Cloud Storage (GCS) and AWS S3 buckets.
Currently all DICOM files are maintained in buckets that allow for free egress within or out of the cloud. This is enabled through the partnership of IDC with Google Public Data Program and the AWS Open Data Sponsorship Program.

Data covered by a non-restrictive license (CC-BY or like) and not labeled as such that may contain head scans. This category contains >90% of the data in IDC.
AWS: idc-open-data
GCS: idc-open-data
(until IDC v19, we utilized GCS bucket public-datasets-idc
before it was superseded by idc-open-data
)
Collections that may contain head scans. This is done for the collections that were labeled as such by TCIA, in case there is a change in policy and we need to treat such images in any special way in the future.
AWS: idc-open-data-two
GCS: idc-open-idc1
Data that is covered by a license that restricts commercial use (CC-NC). Note that the license information is available programmatically at the granularity of the individual files, as explained in this tutorial - you do not need to check the bucket name to get the license information!
AWS: idc-open-data-cr
GCS: idc-open-cr
Within each bucket files are organized in folders, each folder containing files corresponding to a single DICOM series. On ingestion, we assign each DICOM series and each DICOM instance a UUID, in order to be able to support data versioning (when needed). These UUIDs are available in our metadata indices, and are used to organized the content of the buckets: for each version of a DICOM instance having instance UUID instance_uuid
in a version of a series version having UUID series_uuid
, the file name is:
<series_uuid>/<instance_uuid>.dcm
Corresponding files have the same object name in GCS and S3, though the name of the containing buckets will be different.
Metadata
IDC metadata tables are provided to help you navigate IDC content and narrow down to the specific files that meet your research interests.
As a step in data ingestion process (summarized earlier), IDC extracts all of the DICOM metadata, merges it with collection-level and some other metadata attributes not available from DICOM, ingests collection-level clinical tables and stores the result in Google BigQuery tables. Google BigQuery (BQ) is a massively-parallel analytics engine ideal for working with tabular data. Data stored in BQ can be accessed using standard SQL queries. We talk more about those in the subsequent sections of the documentation!
Searching BigQuery tables requires you to sign in with a Google Account! If this poses a problem for you, there are several alternatives.
idc-index
provides access to the metadata aggregated at the DICOM series level. BigQuery and Parquet files provide metadata at the granularity of individual DICOM instances (files).
Python idc-index package
A small subset of most critical metadata attributes available in IDC BigQuery tables is extracted and made available via idc-index
python package.
If you are just starting with IDC, you can skip the details covering the content of BigQuery tables, and proceed to this tutorial that will help you learn basics of searching IDC metadata using idc-index
. But for the sake of example, you would select and download MR DICOM series available in IDC as follows.
pip install --upgrade idc-index
from idc_index import IDCClient
# instantiate the client
client = IDCClient()
# define and execute the query
selection_query = """
SELECT SeriesInstanceUID
FROM index
WHERE Modality = 'MR'
"""
selection_result = client.sql_query(selection_query)
# download the first series from the list
client.download_dicom_series(seriesInstanceUID=selection_result["SeriesInstanceUID"].values[0],downloadDir=".")
Parquet files available via a cloud bucket
We export all the content available via BigQuery into Parquet (https://parquet.apache.org/) files available from our public AWS bucket! Using open-source tools such as DuckDB (https://duckdb.org/) you can query those files using SQL queries, without relying on BigQuery (although, running complex queries may require significant resources from your runtime environment!).
The exported Parquet files are located in the IDC-maintained AWS idc-open-metadata
bucket, which is updated every time IDC has a new data release. The exported tables are organized under the folder bigquery_export
in that bucket, with each sub-folder corresponding to a BigQuery dataset.
Assuming you have s5cmd
installed, you can list the exported datasets as follows.
$ s5cmd --no-sign-request ls s3://idc-open-metadata/bigquery_export/
DIR idc_current/
DIR idc_current_clinical/
DIR idc_v1/
DIR idc_v10/
DIR idc_v11/
DIR idc_v11_clinical/
DIR idc_v12/
DIR idc_v12_clinical/
DIR idc_v13/
DIR idc_v13_clinical/
DIR idc_v14/
DIR idc_v14_clinical/
DIR idc_v15/
DIR idc_v15_clinical/
DIR idc_v16/
DIR idc_v16_clinical/
DIR idc_v17/
DIR idc_v17_clinical/
DIR idc_v18/
DIR idc_v18_clinical/
DIR idc_v19/
DIR idc_v19_clinical/
DIR idc_v2/
DIR idc_v20/
DIR idc_v20_clinical/
DIR idc_v21/
DIR idc_v21_clinical/
DIR idc_v3/
DIR idc_v4/
DIR idc_v5/
DIR idc_v6/
DIR idc_v7/
DIR idc_v8/
DIR idc_v9/
As an example, the dicom_all
table for the latest (current) IDC release will be in s3://idc-open-metadata/bigquery_export/idc_current/dicom_all
(since the table is quite large, the export result is not a single file, but a folder containing thousands of Parquet files.
$ s5cmd --no-sign-request ls s3://idc-open-metadata/bigquery_export/idc_current/dicom_all/
2024/11/23 18:01:07 7545045 000000000000.parquet
2024/11/23 18:01:07 7687834 000000000001.parquet
2024/11/23 18:01:07 7409070 000000000002.parquet
2024/11/23 18:01:07 7527558 000000000003.parquet
...
...
2024/11/23 18:00:14 7501451 000000004997.parquet
2024/11/23 18:00:14 7521972 000000004998.parquet
2024/11/23 18:00:14 7575037 000000004999.parquet
2024/09/12 18:20:05 588723 000000005000.parquet
You can query those tables/parquet files without downloading them, as shown in the following snippet. Depending on the query you are trying to execute, you may need a lot of patience!
import duckdb
# Connect to DuckDB (in-memory)
con = duckdb.connect()
# Install and load the httpfs extension for S3 access
con.execute("INSTALL httpfs;")
con.execute("LOAD httpfs;")
# No credentials needed for public buckets
# Query all Parquet files in the public S3 folder
selection_query = """
SELECT SeriesInstanceUID
FROM read_parquet('s3://idc-open-metadata/bigquery_export/idc_current/dicom_all/*.parquet') AS dicom_all
WHERE Modality = 'MR'
LIMIT 1
"""
selection_result = con.execute(selection_query).fetchdf()
print(selection_result['SeriesInstanceUID'].values[0])
Last updated
Was this helpful?