Clinical data
Check out our IDC clinical data exploration Colab notebook tutorial for a brief hands-on introduction into IDC clinical data! You can also see the high-level summary of the clinical data attributes accompanying IDC data in this dashboard.
Background
By clinical data we refer to the broad spectrum of image-related data that may accompany images. Such data may include demographics of the patients, observations related to their clinical history (therapies, diagnoses, findings), lab tests, surgeries.
Clinical data is often critical in understanding imaging data, and is essential for the development and validation of imaging biomarkers. However, such data is most often stored in spreadsheets that follow conventions specific to the site that collected the data, may not be accompanied by the dictionary defining the terms used in describing clinical data, and is rarely harmonized. This can be observed on the example of various collections ingested into IDC from The Cancer Imaging Archive (TCIA), such as the ACRIN 6698 collection.
Not only are the terms used in the clinical data accompanying individual collection not harmonized, but the format of the spreadsheets is also collection-specific. In order to search and navigate clinical data, one has to parse those collection specific tables, and there is no interface to support searching across collections.
With the release v11 of IDC, we make the attempt to lower the barriers for accessing clinical data accompanying IDC imaging collections. We parse collection-specific tables, and organize the underlying data into BigQuery tables that can be accessed using standard SQL queries. You can also see the summary of clinical data available for IDC collections in this dashboard.
Clinical data BigQuery tables
As of Version 11 IDC provides a public BigQuery dataset with clinical data associated with several of its imaging collections. The clinical data tables associated with a particular version are in the dataset bigquery-public-data.idc_<idc_version_number>_clinical
. In addition the dataset bigquery-public-data.idc_current_clinical
has an identically named view for each table in the BQ clinical dataset corresponding to the current IDC release.
There are currently 130 tables with clinical data representing 70 different collections. Most of this data was curated from Excel and CSV files downloaded from The Cancer Imaging Archive (TCIA) wiki. For most collections, data is placed in a single table named <collection_id>_clinical
, where <collection_id>
is the name of the collection in a standardized format (i.e. the idc_webapp_collection_id
column in the dicom_all
view in the idc_current dataset).
Collections from the ACRIN project have different types of clinical data spread across CSV files, and so this data is represented by several BigQuery tables. The clinical data for collections in the CPTAC program program is not curated from TCIA but instead is copied from a BigQuery table in the ISB-CGC project, which in turn was sourced from the Genomics Data Commons (GDC) api. Similarly clinical data for collections in the TCGA program is copied from the table tcga_clinical_rel9
in the idc_current
dataset, which was also created using the Genomics Data Commons (GDC) api. Every clinical data table contains two fields we have introduced, dicom_patient_id
and source_batch
. dicom_patient_id
is identical to the PatientID
field in the DICOM files that correspond to the given patient. The dicom_patient_id
value is determined by inspecting the patient column in the clinical data file. In some of the collections' clinical data, the patients are separated into different 'batches' i.e. different source files, or different sheets in the same Excel file. The source_batch
field is an integer indicating the 'batch' for the given patient. For most collections, in which all patients data is found in the same location, the source_batch
value is zero.
Most of the clinical tables are legible by themselves. Tables from the ACRIN collection are an exception as the column names and some of the column values are coded. To provide for clarity and ease of use of all clinical data, we have created two metadata tables, table_metadata
and column_metadata
that provide information about the structure and provenance of all data in this dataset. table_metadata
has table-level metadata about each clinical collection, while column_metadata
has column-level metadata.
Structure of thetable_metadata
table:
collection_id
(STRING, NULLABLE) - the collection_id of the collection in the given table. The collection id is in a format used internally by the IDC Web App (with only lowercase letters, numbers and '_' allowed). It is equivalent to theidc_webapp_id
field in thedicom_all
view in theidc_current
dataset.table_name
(STRING,NULLABLE) - name of the tabletable_description
(STRING,NULLABLE) - description of the type of data found in the table. Usually this is set to 'clinical data', unless a description is provided in the source filesidc_version_table_added
(STRING, NULLABLE) - the IDC data version for which this table was first addedidc_table_added_datetime
(STRING,NULLABLE) - the date/time this particular table was first generatedpost_process_src
(STRING, NULLABLE) - except for the CPTAC and TCGA collections the tables are curated from ZIP, Excel, and CSV files downloaded from the TCIA wiki. These files do not have a consistent structure and were not meant to be machine readable or to translate directly into BigQuery. A semi-manual curation process results in either a CSV of JSON file that can be directly written into a BigQuery table. post_process_src is the name of the JSON or CSV file that results from this process and is used to create the BigQuery table. This field is not used for the CPTAC- and TCGA-related tablespost_process_src_add_md5
(STRING, NULLABLE) - the md5 hash of post_process_src when the table was first addedidc_version_table_prior
(STRING, NULLABLE) - the idc version the second most recent time the table was updatedpost_process_src_prior_md5
(STRING, NULLABLE) - the md5 hash of post_process_src the second most recent time the table was updatedidc_version_table_updated
(STRING, NULLABLE) - the idc version when the table was last updatedtable_update_datetime
(STRING, NULLABLE) - date and time an update of the table was last recordedpost_process_src_updated_md5
(STRING, NULLABLE) - the md5 hash of post_process_source when the table was last updatednumber_batches
(INTEGER, NULLABLE) - records the number of batches. Within the source data patients are sometimes grouped into different 'batches' (i.e. training vs test, responder vs non-responder etc.) and the batches are placed in different locations (i.e. different files or different sheets in the same Excel file)source_info
(RECORD, REPEATED) - an array of records with information about the table sources. These sources are either files downloaded from the TCIA wiki or another BigQuery table (as is the case for CPTAC and TCGA collections). There is a source_info record for each source 'batch' described abovesource_info.srcs
(STRING, REPEATED) - a source file downloaded from the TCIA wiki may be a ZIP file, and CSV file, or an Excel file. Sometimes the ZIP files contain other ZIP files that must be opened to extract the clinical data. In thesource_info.src
array the first string is the file that is downloaded from TCIA for this particular source batch. The final string is the CSV or Excel file that contains the clinical data. Any intermediate strings are the names of ZIP files 'in between' the downloaded file and the clinical file. For CPTAC and TCGA collections this field contains the source BigQuery tablesource_info.md5
(STRING, NULLABLE) - md5 hash of the downloaded file from TCIA the most recent time the table was updatedsource_info.table_last_modified
(STRING, NULLABLE) - CPTAC and TCGA collections only. The date and time the source BigQuery table was most recently modified, as recorded when last copiedsource_info.table_size
(STRING, NULLABLE) - CPTAC and TCGA collections only. The size of the source BigQuery table as recorded when last copied
Structure of column_metadata
table:
collection_id
(STRING,NULLABLE) - the collection_id of the collection in the given table. The collection id is in a format used internally by the IDC Web App (with only lowercase letters, numbers and '_' allowed). It is equivalent to theidc_webapp_id
field in thedicom_all
view in theidc_current
dataset.case_col
(BOOLEAN, NULLABLE) - true if the BigQuery column contains the patient or case id, i.e. if this column is used to determine the value of thedicom_patient_id
columntable_name
(STRING, NULLABLE) - table namecolumn
(STRING, NULLABLE) - the actual column name in the table. For ACRIN collections thecolumn_name
is thevariable_name
from the provided data dictionary. For other collections it is a name constructed by 'normalizing' thecolumn_label
(see next) in a format that can be used as a BigQuery field namecolumn_label
(STRING, NULLABLE) - a 'free form' label for the column that does not need to conform to the BigQuery column format requirements. For ACRIN collections this is thevariable_label
given by a data dictionary that accompanies the collection. For other collections it is the name or label of the clinical attribute as inferred from the source document during the curation processdata_type
(STRING, NULLABLE) - the type of data in this column. Again for ACRIN collections this is provided in the data dictionary. For other collections it is inferred by analyzing the data during curationoriginal_column_headers
(STRING, REPEATED) - the name(s) or label(s) in the source document that were used to construct thecolumn_label
field. In most cases there is one column label in the source document that perscribes thecolumn_label
. In some cases, multiple columns are concantenated and reformated to form thecolumn_label
values
(RECORD, REPEATED) - a structure that is borrowed from the ACRIN data model. This is an array that contains observerd attribute values for this given column. For ACRIN collections these values are reported in the data dictionary. For most other collections these values are determined by analyzing the source data. For simplicity this field is left blank when the number of unique values is greater than 20values.option_code
(STRING, NULLABLE) - a unique attribute value found in this columnvalues.option_description
(STRING, NULLABLE) - a description of theoption_code
as provided by a data dictionary. For collections that do not have a data dictionary this is null.values_source
(STRING, NULLABLE) - indicates the source of thevalues
records. The text 'provided dictionary' indicates that the records were obtained from a provided data dictionary. The text 'derived from inspection of values' indicates that the records were determined by automated analysis of the source materials during the ETL process that generated the BigQuery tables.files
(STRING, REPEATED) - names of the files that contain the source data for each batch. These are the Excel or CSV files directly downloaded from TCIA, or the files extracted from downloaded ZIP filessheet_names
(STRING, REPEATED) - for Excel-sourced files, the sheet names containing this column's values for each batchbatch
(INTEGER, REPEATED) - source batches that contain this particular column. Some columns or attributes may be missing from some batchescolumn_numbers
(STRING, REPEATED) - for each source batch, the column in the original source corresponding to this column in the BigQuery table
Last updated