[FEATURE REQUEST] MD5 checksum verification for sra-download workflow
The need
While using the sra-download workflow (short description and code) to download some publicly available metagenomes, I realized that the current workflow does not verify MD5 checksums to check for file corruption.
To ensure data integrity, I manually checked the MD5 checksums of the files against the ENA metadata checksums (see Additional Info [1]). The checksums did not match, raising concerns about potential file corruption. While investigating, I reached out to @mschecht, who developed the workflow.
The issue is that the sra-download workflow downloads unzipped files (specifically, a .sra file containing the fastq files) and gzips them post-download. However, the MD5 checksums from ENA are for the .fastq.gz files. Unzipping and re-zipping changes the checksum (see Additional Info [2]), making the checksums provided by ENA useless here.
Further investigation led me to this issue on obtaining MD5 checksums from NCBI: https://github.com/ncbi/sra-tools/issues/531. These checksums are for the .sra files and should suffice to ensure the files were not corrupted during download.
After discussing with @mschecht, we plan to include an MD5 checksum check in the workflow for the .sra files stored in the temporary 01_NCBI_SRA directory before extracting the .fastq files.
The solution
Integrate a new rule with an MD5 checksum verification step in the sra-download workflow to check the .sra files in the temp directory before fasterq-dump extracts the FASTQ files.
For this, we could do something similar to:
-
Use
curlto retrieve the MD5 checksums and accession numbers from the NCBI API:#!/bin/bash # Read each line from the accession_numbers.txt file while IFS= read -r accession; do # Construct the curl command using the accession number curl "https://locate.ncbi.nlm.nih.gov/sdl/2/retrieve?filetype=run&location-type=forced&location=s3.us-east-1&accept-charges=aws&acc=${accession}" -o "${accession}.json" done < accession_numbers.txtthe .json files look like this:
{"version": "2","result": [{"bundle": "ERR6450080","status": 200,"msg": "ok","files": [{"object": "srapub|ERR6450080.lite","accession": "ERR6450080","type": "sra","name": "ERR6450080.lite","size": 18201770,"md5": "af47d427645978867adef6a13a439840","modificationDate": "2022-10-03T20:52:37Z","noqual": true,"locations": [{"service": "gs","region": "us-east1","payRequired": true,"bucket": "sra-pub-zq-106","key": "ERR6450080/ERR6450080.lite.1","link": "https://storage.googleapis.com/sra-pub-zq-106/ERR6450080/ERR6450080.lite.1"}]}]}]} -
Extract the MD5 checksums and accession numbers from the .json files:
#!/bin/bash # Output file for checksums output_file="checksums.txt" # Clear the output file if it exists > $output_file # Loop through each JSON file for json_file in *.json; do # Extract accession number and md5 checksum from the JSON file accession=$(jq -r '.result[0].bundle' "$json_file") md5=$(jq -r '.result[0].files[0].md5' "$json_file") # Append the extracted information to the output file echo "$accession $md5" >> $output_file done -
Compare the MD5 checksums with the checksums of the downloaded .sra files:
#!/bin/bash # Read the checksums file and compare md5 checksums of .sra files while IFS=' ' read -r accession md5_expected; do # Compute the md5 checksum of the .sra file md5_actual=$(md5sum "$accession.sra" | awk '{ print $1 }') # Compare the checksums if [ "$md5_expected" == "$md5_actual" ]; then echo "$accession: Checksums match" else echo "$accession: Checksums do not match" fi done < checksums.txt
Beneficiaries
Anyone using this workflow to download nucleotide sequence data from NCBI SRA who wants to ensure that their files have not been corrupted during downloading.
Additional info
[1] How to get ENA metadata with MD5 checksums
Use the following script to download the metadata (including MD5 checksums) for projects from ENA:
import requests
import os
# Define the list of project accessions
project_accessions = ['PRJEB1787', 'PRJEB8682', 'PRJNA656268', 'PRJNA385854', 'PRJEB52452', 'PRJNA385855', 'PRJNA385855', 'PRJNA16339', 'PRJNA352737', 'PRJEB2064']
# Define the API endpoint for the project metadata
url = "https://www.ebi.ac.uk/ena/portal/api/search"
# Define the common parameters for the API request
fields = 'accession,altitude,assembly_quality,assembly_software,base_count,binning_software,bio_material,broker_name,cell_line,cell_type,center_name,checklist,collected_by,collection_date,completeness_score,contamination_score,country,cultivar,culture_collection,depth,description,dev_stage,ecotype,elevation,environment_biome,environment_feature,environment_material,environmental_sample,experiment_accession,experiment_alias,experiment_title,experimental_factor,fastq_aspera,fastq_bytes,fastq_ftp,fastq_galaxy,fastq_md5,first_created,first_public,germline,host,host_body_site,host_genotype,host_gravidity,host_growth_conditions,host_phenotype,host_sex,host_status,host_tax_id,identified_by,instrument_model,instrument_platform,investigation_type,isolate,isolation_source,last_updated,lat,library_layout,library_name,library_selection,library_source,library_strategy,location,lon,mating_type,nominal_length,nominal_sdev,ph,project_name,protocol_label,read_count,run_accession,run_alias,salinity,sample_accession,sample_alias,sample_collection,sample_description,sample_material,sample_title,sampling_campaign,sampling_platform,sampling_site,scientific_name,secondary_sample_accession,secondary_study_accession,sequencing_method,serotype,serovar,sex,specimen_voucher,sra_aspera,sra_bytes,sra_ftp,sra_galaxy,sra_md5,strain,study_accession,study_alias,study_title,sub_species,sub_strain,submission_accession,submitted_aspera,submitted_bytes,submitted_format,submitted_ftp,submitted_galaxy,submitted_host_sex,submitted_md5,target_gene,tax_id,taxonomic_classification,taxonomic_identity_marker,temperature,tissue_lib,tissue_type,variety'
common_params = {
'result': 'read_run',
'fields': fields,
'format': 'tsv',
'limit': '0'
}
# Loop through each project accession
for project_accession in project_accessions:
# Update the query parameter for each project
params = common_params.copy()
params['query'] = f'study_accession={project_accession}'
# Send the API request
response = requests.get(url, params=params)
# Check if the request was successful
if response.status_code == 200:
# Define the path to the output file
output_file_path = os.path.join('..', 'data', f'{project_accession}_metadata.csv')
# Save the metadata to the specified file
with open(output_file_path, 'w') as file:
file.write(response.text)
print(f"Metadata for {project_accession} has been successfully downloaded and saved to '{output_file_path}'.")
else:
print(f"Failed to retrieve metadata for {project_accession}: {response.status_code}")
print(response.text)
[2] Investigation on unzipping and re-zipping changing checksums
The sra-download workflow downloads the unzipped files and then zips them. The MD5 checksums in the metadata file are for the zipped files. The question is whether unzipping and re-zipping changes the checksum. Let's test it:
ls
# SRR5787991_1.fastq.gz SRR5787991_2.fastq.gz ftp-links-for-raw-data-files.txt
md5sum *.gz
# b47f9d6458b84edacc9b1608eb020732 SRR5787991_1.fastq.gz
# d8bb8f1726bb7173a359be928c38bdba SRR5787991_2.fastq.gz
gunzip *_1.fastq.gz
ls
# SRR5787991_1.fastq SRR5787991_2.fastq.gz ftp-links-for-raw-data-files.txt
gzip *fastq
ls
# SRR5787991_1.fastq.gz SRR5787991_2.fastq.gz ftp-links-for-raw-data-files.txt
md5sum *_1.fastq.gz
# e623cf7b937d449a834364194cf4e795 SRR5787991_1.fastq.gz
MD5 checksums change due to unzipping and re-zipping, hence the provided checksums are not applicable. Using the MD5 checksums of the .sra files ensures integrity.
Alternative to using SRA
As a sanity check (and to exclude the possibility that the md5 checksums I got from ENA are incorrect), I tried a different method for downloading some of the data with wget and the FTP links to each run. For these downloads, the MD5 checksums matched.
This is how one would use the ENA API and wget to avoid needing to rely on the SRA toolkit.
One needs the ftp links (which you get in [1]) that one would add to a file such as ftp-links-raw-files.txt, which could look like:
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR315/ERR315858/ERR315858_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR315/ERR315858/ERR315858_2.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR315/ERR315861/ERR315861_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR315/ERR315861/ERR315861_2.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR315/ERR315862/ERR315862_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR315/ERR315862/ERR315862_2.fastq.gz
and then you could simply iterate through it:
for url in `cat ftp-links-for-raw-data-files.txt`
do
wget $url
done
Then check the md5 checksums of all downloaded .fastq files md5sum *.fastq and compare those against the md5 checksums given in the ENA metadata.
And while this is a very viable option for which we could also write a workflow, it would be nice to also have the SRA option.