Downloading data with s5cmd
Make sure you first review the Downloading data section to learn about the simpler interfaces that provide access to IDC data.
SlicerIDCBrowser and idc-index
discussed in the previous section aim to provide simple interfaces for data access. In some situations, however, you may want to build cohorts using metadata attributes that are not exposed in those tools. In such cases you will need to use BigQuery interface to form your cohort and build a file manifest that you can then use with s5cmd
to download the files.
With this approach you will follow a a 2-step process covered on this page:
Step 1: create a manifest - a list of the storage bucket URLs of the files to be downloaded. if you want to download the content of the cohort defined in the IDC Portal, export the
s5cmd
manifest fist, and proceed to Step 2. Alternatively, you can use BigQuery SQL as discussed below to generate the manifest;Step 2: given the manifest, download files to your computer or to a cloud VM using
s5cmd
command line tool.
To learn more about using Google BigQuery SQL with IDC, check out part 3 of our "Getting started" tutorial series, which demonstrates how to query and download IDC data!
Step 1: Create the manifest
You will need to complete prerequisites described in Getting started with GCP in order to be able to execute the manifest generation queries below!
A download manifest can be created using either the IDC Portal, or by executing a BQ query. If you have generated a manifest using the IDC Portal, as discussed here, proceed to Step 2! In the remainder of this section we describe creating a manifest from a BigQuery query.
The dicom_all
BigQuery table discussed in this documentation article can be used to subset the files you need based on the DICOM metadata attributes as needed, utilizing the SQL query interface. The gcs_url
and aws_url
columns contain Google Cloud Storage and AWS S3 URLs, respectively, that can be used to retrieve the files.
Start with the query templates provided below, modify them based on your needs, and save the result in a file query.txt
. The specific values for PatientID
, SeriesInstanceUID
, StudyInstanceUID
are chosen to serve as examples.
You can use IDC Portal to identify items of interest, or you can use SQL queries to subset your data using any of the DICOM attributes. You are encouraged to use the BigQuery console to test your queries and explore the data first!
Queries below demonstrate how to get the Google Storage URLs to download cohort files.
If you want to download the files corresponding to the cohort from GCP instead of AWS, substitute series_aws_url
for series_gcp_url
in the SELECT
statement of the query, such as in the following SELECT clause:
Next, use a Google Cloud SDK bq query
command (from command line) to run the query and save the result into a manifest file, which will be the list of GCP URLs that can be used to download the data.
Make sure you adjust the --max_rows
parameter in the queries above to be equal or exceed the number of rows in the result of the query, otherwise your list will be truncated!
For any of the queries, you can get the count of rows to confirm that the --max_rows
parameter is sufficiently large (use the BigQuery console to run these queries):
You can also get the total disk space that will be needed for the files that you will be downloading:
Step 2: Download the files defined by the manifest
s5cmd
is a very fast S3 and local filesystem execution tool that can be used for accessing IDC buckets and downloading files both from GCS and AWS.
Install s5cmd
following the instructions in https://github.com/peak/s5cmd#installation, or if you have Python pip on you system you can just do pip install s5cmd --upgrade
.
You can verify if your setup was successful by running the following command: it should successfully download one file from IDC.
Once s5cmd
is installed, you can use s5cmd run
command to download the files corresponding to the manifest.
If you defined manifest that references AWS buckets:
If you defined manifest that references GCP buckets, you will need to specify GCS endpoint:
Last updated