Downloading data with s5cmd

Make sure you first review the Downloading data section to learn about the simpler interfaces that provide access to IDC data.

SlicerIDCBrowser and idc-index discussed in the previous section aim to provide simple interfaces for data access. In some situations, however, you may want to build cohorts using metadata attributes that are not exposed in those tools. In such cases you will need to use BigQuery interface to form your cohort and build a file manifest that you can then use with s5cmd to download the files.

With this approach you will follow a a 2-step process covered on this page:

  • Step 1: create a manifest - a list of the storage bucket URLs of the files to be downloaded. if you want to download the content of the cohort defined in the IDC Portal, export the s5cmd manifest fist, and proceed to Step 2. Alternatively, you can use BigQuery SQL as discussed below to generate the manifest;

  • Step 2: given the manifest, download files to your computer or to a cloud VM using s5cmd command line tool.

To learn more about using Google BigQuery SQL with IDC, check out part 3 of our "Getting started" tutorial series, which demonstrates how to query and download IDC data!

Step 1: Create the manifest

You will need to complete prerequisites described in Getting started with GCP in order to be able to execute the manifest generation queries below!

A download manifest can be created using either the IDC Portal, or by executing a BQ query. If you have generated a manifest using the IDC Portal, as discussed here, proceed to Step 2! In the remainder of this section we describe creating a manifest from a BigQuery query.

The dicom_all BigQuery table discussed in this documentation article can be used to subset the files you need based on the DICOM metadata attributes as needed, utilizing the SQL query interface. The gcs_url and aws_url columns contain Google Cloud Storage and AWS S3 URLs, respectively, that can be used to retrieve the files.

Start with the query templates provided below, modify them based on your needs, and save the result in a file query.txt. The specific values for PatientID, SeriesInstanceUID, StudyInstanceUID are chosen to serve as examples.

You can use IDC Portal to identify items of interest, or you can use SQL queries to subset your data using any of the DICOM attributes. You are encouraged to use the BigQuery console to test your queries and explore the data first!

Queries below demonstrate how to get the Google Storage URLs to download cohort files.

# Select all files for a given PatientID
SELECT DISTINCT(CONCAT(series_aws_url, "* .")) 
FROM `bigquery-public-data.idc_current.dicom_all`
WHERE PatientID = "LUNG1-001"
# Select all files for a given collection
SELECT DISTINCT(CONCAT(series_aws_url, "* .")) 
FROM `bigquery-public-data.idc_current.dicom_all`
WHERE collection_id = "nsclc_radiomics"
# Select all files for a given DICOM series
SELECT DISTINCT(CONCAT(series_aws_url, "* .")) 
FROM `bigquery-public-data.idc_current.dicom_all`
WHERE SeriesInstanceUID = "1.3.6.1.4.1.32722.99.99.298991776521342375010861296712563382046"
# Select all files for a given DICOM study
SELECT DISTINCT(CONCAT(series_aws_url, "* .")) 
FROM `bigquery-public-data.idc_current.dicom_all`
WHERE StudyInstanceUID = "1.3.6.1.4.1.32722.99.99.239341353911714368772597187099978969331"

If you want to download the files corresponding to the cohort from GCP instead of AWS, substitute series_aws_url for series_gcp_url in the SELECT statement of the query, such as in the following SELECT clause:

SELECT DISTINCT(CONCAT(series_gcp_url, "* ."))

Next, use a Google Cloud SDK bq query command (from command line) to run the query and save the result into a manifest file, which will be the list of GCP URLs that can be used to download the data.

bq query --use_legacy_sql=false --format=csv --max_rows=20000000 < query.txt > manifest.txt

Make sure you adjust the --max_rows parameter in the queries above to be equal or exceed the number of rows in the result of the query, otherwise your list will be truncated!

For any of the queries, you can get the count of rows to confirm that the --max_rows parameter is sufficiently large (use the BigQuery console to run these queries):

# count the number of rows
SELECT COUNT(DISTINCT(crdc_series_uuid)) 
FROM bigquery-public-data.idc_current.dicom_all 
WHERE collection_id = "nsclc_radiomics"

You can also get the total disk space that will be needed for the files that you will be downloading:

# calculate the disk size in GB needed for the files to be downloaded
SELECT ROUND(SUM(instance_size)/POW(1024,3),2) as size_GB 
FROM bigquery-public-data.idc_current.dicom_all 
WHERE collection_id = "nsclc_radiomics"

Step 2: Download the files defined by the manifest

s5cmd is a very fast S3 and local filesystem execution tool that can be used for accessing IDC buckets and downloading files both from GCS and AWS.

Install s5cmd following the instructions in https://github.com/peak/s5cmd#installation, or if you have Python pip on you system you can just do pip install s5cmd --upgrade.

You can verify if your setup was successful by running the following command: it should successfully download one file from IDC.

s5cmd --no-sign-request --endpoint-url https://storage.googleapis.com cp s3://public-datasets-idc/cdac3f73-4fc9-4e0d-913b-b64aa3100977/902b4588-6f10-4342-9c80-f1054e67ee83.dcm .

Once s5cmd is installed, you can use s5cmd run command to download the files corresponding to the manifest.

If you defined manifest that references AWS buckets:

s5cmd --no-sign-request --endpoint-url=https://s3.amazonaws.com run manifest_file_name

If you defined manifest that references GCP buckets, you will need to specify GCS endpoint:

s5cmd --no-sign-request  run manifest_file_name

Last updated