IDC API Concepts

The IDC API is based on IDC Data Model concepts. Several of these concepts have been previously introduced in the context of the IDC Portal. We discuss these concepts here with respect to the IDC API.

IDC Versions

As described previously, IDC data is versioned such that searching an IDC version according to some criteria (some filter set as described below) will always identify exactly the same set of DICOM objects.

The GET /versions API endpoint returns a list of the current and previous IDC data versions.

Original Collections

An original collection is a set of DICOM data provided by a single source. (We usually just use collection to mean original collection.) Such collections are comprised primarily of DICOM image data that was obtained from some set of patients. However some original collections also include annotations, segmentations or other analyses of the image data in the collection. Typically, the patients in an collection are related by a common cancer type, though this is not always the case.

The GET /collections endpoint returns a list of the original collections, in the current IDC version. Some metadata about each collection is provided.

Analysis Results

Analysis results are comprised of DICOM data that was generated by analyzing data in one or more original collections. Typically such analysis is performed by a different entity than that which provided the original collection(s) on which the analysis is based. Examples of data in analysis collections include segmentations, annotations and further processing of original images.

Because a DICOM instance in an analysis result is "in" the same series and study as the DICOM instance data of which it is an analysis result, it is also "in" the same patient, and therefore is considered to be "in" the same collection.

Specifically, each instance in IDC data has an associated collection_id. An analysis result will have the same collection_id as the original collection of which it is an analysis result.

The GET /analysis_results endpoint returns a list of the analysis results, with some metadata, in the current IDC version.

Filter Sets

A filter set selects some set of DICOM objects in IDC hosted data, and is a set of conditions, where each condition is defined by an attribute and an array of values. An attribute identifies a field (column) in some data source (BQ table). Each filter set also includes the IDC data version upon which it operates.

Filter sets are JSON encoded. Here is an example filter set:

{
  "filters": {
    "collection_id": [
      "TCGA-LUAD",
      "TCGA-KIRC"
    ],
    "Modality": [
      "CT",
      "MR"
    ],
    "race": [
      "WHITE"
    ],
    "age_at_diagnosis_btw": [
      65, 
      75
    ]
  }
}

A filter set selects a DICOM instance if, for every attribute in the filter set, the instance's corresponding value satisfies one or more of the values in the associated array of values. This is explained further below.

For example, the (attribute, [values]) pair ("Modality", ["MR", "CT"]) is satisfied if an instance "has" a Modality of MR or CT.

Note that if a filter set includes more than one (attribute, [values]) pair having the same attribute, then only the last such (attribute, [values]) pair is used. Thus if a filter group includes the (attribute, [values]) pairs ("Modality", ["MR"]) and ("Modality", ["CT"]), in that order, only ("Modality", ["CT"]) is used.

The filter set above will select any instance in the current IDC version that is in the TCGA_KIRC collection or the TCGA_LUAD' collections. To be selected by the filter, an instance must also have a Modality of CT or MR, and an age_at_diagnosis value between 65 and 75 .

Because of the hierarchical nature of DICOM, if a filter set selects an instance, it implicitly selects the series, study, patient and collection which contain that instance. A manifest can be configured to return data about some or all of these entities.

Note that when defining a cohort through the API, the IDC version is always the current IDC version.

Data Sources

IDC maintains a set of GCP BigQuery (BQ) tables containing various types of metadata that together describe IDC data.

In the context of the API, a data source (or just source) is a BQ table that contains some portion of the metadata against which a filter set is applied. An API query to construct a manifest is performed against one or more such tables as needed.

Attributes

Both the IDC Web App and API expose selected fields against which queries can be performed. The /filters endpoint returns the available filter attributes The /filters/values/{filter} endpoint returns a list of the values which a specified Categorical String or Categorical Numeric filter attribute will match. Each attribute has a data type, one of:

String: An attribute with data type String may have an arbitrary string value. For example, the possible values of a StudyDescription attribute are arbitrary. An object is selected if its String attribute matches any of the values in the values array. Matching is insensitive to the case (upper case, lower case) of the characters in the strings. Thus ("StudyDescription",["PETCT Skull-Thigh"] will match a StudyDescription containing the substring "PETCT SKULL-THIGH", or "petct skull-thigh" etc. Pattern matching in String attributes is also supported. The ('StudyDescription",["%SKULL%", "ABDOMEN%", "%Pelvis"]) filter will match any StudyDescription that contains "SKULL", "skull", "Skull", etc., starts with "ABDOMEN", "abdomen", etc., or ends with "Pelvis", "PELVIS", etc.
Categorical String An attribute with data type Categorical String will have one of a defined set of string values. For example, Modality is an Categorical String attribute that has possible values 'CT', 'MR', 'PT', etc. Categorical String attributes have the same matching semantics as for Strings. The /filters/values/{filter} endpoint returns a list of the values accepted for a specified Categorical String attribute (filter).
Categorical Numeric An attribute with data type Categorical Numeric has one of a defined set of numeric values. The corresponding value array must have a single numeric value. The (attribute, value array) pair for a Categorical Numeric is satisfied if the attribute is equal to the value in the value array. The /filters/values/{filter} endpoint returns a list of the values accepted for a Categorical Numeric attribute (filter).
Ranged Integer An attribute with data type Ranged Integer will have an integer value. For example, age_at_diagnosis is an attribute of data type Ranged Integer. In order to enable relative numeric queries, the API exposes eight variations of each Ranged Integer attribute as filter attribute names. These variations are the base attribute name with one of the suffixes: eq, gt, gte, btw, btwe, ebtw, ebtwe, lte, or lt, e.g. age_at_diagnosis_eq. The value array of the btw, btwe, ebtw, and ebtwe variations must contain exactly two integer values, in numeric order (least value first). The value array of the eq, gt, gte, lte, and lt variations must contain exactly one integer values. The (attribute, value array) pair for a Ranged Integer attribute is satisfied according to the suffix as follows:
- eq: If an attribute is equal to the value in the value array
- gt: If an attribute is greater than the value in the value array
- gte: If an attribute is greater than or equal to the value in the value array
- btw: if an attribute is greater than the first value and less than the second value in the value array
- ebtw: if an attribute is greater than or equal to the first value and less than the second value in the value array
- btwe: if an attribute is greater than the first value and less than or equal to the second value in the value array
- ebtwe: if an attribute is greater than or equal to the first value and less than or equal to the second value in the value array
- lte: If an attribute is less than or equal to the value in the value array
- lt: If an attribute is less than the value in the value array
Ranged Number An attribute with data type Ranged Number will have a numeric (integer or float) value. For example, diameter is an attribute of data type Ranged Number. In order to enable relative numeric queries, the API exposes eight variations of each Ranged Number attribute as filter attribute names. These variations are the base attribute name with one of the suffixes: eq, gt, gte, btw, btwe, ebtw, ebtwe, lte, or lt, e.g. diameter_eq. The value array of the btw, btwe, ebtw, and ebtwe variations must contain exactly two numeric values, in numeric order (least value first). The value array of the eq, gt, gte, lte, and lt variations must contain exactly one numeric values. The (attribute, value array) pair for a Ranged Number attribute is satisfied according to the suffix as follows:
- eq: If an attribute is equal to the value in the value array
- gt: If an attribute is greater than the value in the value array
- gte: If an attribute is greater than or equal to the value in the value array
- btw: if an attribute is greater than the first value and less than the second value in the value array
- ebtw: if an attribute is greater than or equal to the first value and less than the second value in the value array
- btwe: if an attribute is greater than the first value and less than or equal to the second value in the value array
- ebtwe: if an attribute is greater than or equal to the first value and less than or equal to the second value in the value array
- lte: If an attribute is less than or equal to the value in the value array
- lt: If an attribute is less than the value in the value array

Cohorts

A cohort is the set of DICOM objects in IDC hosted data selected by a filter set.

The API no longer supports user defined cohorts. However, the POST /cohorts/manifest/preview endpoint effectively creates a cohort, queries the cohort to obtain a manifest of metadata of the objects in the cohort, and then deletes the cohort. The data in the manifest is highly configurable and can be used, with suitable tools, to obtain DICOM files from cloud storage. A manifest returned by the API can include values from a large set of fields.

Manifests are discussed in the next section.

IDC API UI

The IDC API UI can be used to see details about the syntax of each call, and also provides an interface to test requests. Each endpoint is also documented the Endpoint Details section.

Make a Request

For a quick demonstration of the syntax of an API call, test the GET /collections request. You can experiment with this endpoint by clicking the ‘Try it out’ button, and then the 'Execute' button.

The API will return collection metadata for the current IDC data version.

Request Response

The Swagger UI submits the request and shows the curl code that was submitted. The Response body section will display the response to the request. The expected JSON schema format of the response to this API request is shown below:

{
  "collections": [
    {
      "cancer_type": "string",
      "collection_id": "string",
      "date_updated": "string",
      "description": "string",
      "doi": "string",
      "image_types": "string",
      "location": "string",
      "species": "string",
      "subject_count": 0,
      "supporting_data": "string",
    }
  ],
  "code": 200
}

The actual JSON formatted response can be downloaded to your local file system by clicking the ‘Download’ button.

PreviousGetting Started NextManifests

Last updated 24 days ago

Was this helpful?