[FEATURE REQUEST] MD5 checksum verification for sra-download workflow

Open raissameyer opened this issue 1 year ago • 0 comments

The need

While using the sra-download workflow (short description and code) to download some publicly available metagenomes, I realized that the current workflow does not verify MD5 checksums to check for file corruption.

To ensure data integrity, I manually checked the MD5 checksums of the files against the ENA metadata checksums (see Additional Info [1]). The checksums did not match, raising concerns about potential file corruption. While investigating, I reached out to @mschecht, who developed the workflow.

The issue is that the sra-download workflow downloads unzipped files (specifically, a .sra file containing the fastq files) and gzips them post-download. However, the MD5 checksums from ENA are for the .fastq.gz files. Unzipping and re-zipping changes the checksum (see Additional Info [2]), making the checksums provided by ENA useless here.

Further investigation led me to this issue on obtaining MD5 checksums from NCBI: https://github.com/ncbi/sra-tools/issues/531. These checksums are for the .sra files and should suffice to ensure the files were not corrupted during download.

After discussing with @mschecht, we plan to include an MD5 checksum check in the workflow for the .sra files stored in the temporary 01_NCBI_SRA directory before extracting the .fastq files.

The solution

Integrate a new rule with an MD5 checksum verification step in the sra-download workflow to check the .sra files in the temp directory before fasterq-dump extracts the FASTQ files.

For this, we could do something similar to:

Use curl to retrieve the MD5 checksums and accession numbers from the NCBI API:

#!/bin/bash

# Read each line from the accession_numbers.txt file
while IFS= read -r accession; do
  # Construct the curl command using the accession number
  curl "https://locate.ncbi.nlm.nih.gov/sdl/2/retrieve?filetype=run&location-type=forced&location=s3.us-east-1&accept-charges=aws&acc=${accession}" -o "${accession}.json"
done < accession_numbers.txt

the .json files look like this:

{"version": "2","result": [{"bundle": "ERR6450080","status": 200,"msg": "ok","files": [{"object": "srapub|ERR6450080.lite","accession": "ERR6450080","type": "sra","name": "ERR6450080.lite","size": 18201770,"md5": "af47d427645978867adef6a13a439840","modificationDate": "2022-10-03T20:52:37Z","noqual": true,"locations": [{"service": "gs","region": "us-east1","payRequired": true,"bucket": "sra-pub-zq-106","key": "ERR6450080/ERR6450080.lite.1","link": "https://storage.googleapis.com/sra-pub-zq-106/ERR6450080/ERR6450080.lite.1"}]}]}]}

Extract the MD5 checksums and accession numbers from the .json files:

#!/bin/bash

# Output file for checksums
output_file="checksums.txt"

# Clear the output file if it exists
> $output_file

# Loop through each JSON file
for json_file in *.json; do
  # Extract accession number and md5 checksum from the JSON file
  accession=$(jq -r '.result[0].bundle' "$json_file")
  md5=$(jq -r '.result[0].files[0].md5' "$json_file")

  # Append the extracted information to the output file
  echo "$accession $md5" >> $output_file
done

Compare the MD5 checksums with the checksums of the downloaded .sra files:

#!/bin/bash

# Read the checksums file and compare md5 checksums of .sra files
while IFS=' ' read -r accession md5_expected; do
  # Compute the md5 checksum of the .sra file
  md5_actual=$(md5sum "$accession.sra" | awk '{ print $1 }')

  # Compare the checksums
  if [ "$md5_expected" == "$md5_actual" ]; then
    echo "$accession: Checksums match"
  else
    echo "$accession: Checksums do not match"
  fi
done < checksums.txt

Beneficiaries

Anyone using this workflow to download nucleotide sequence data from NCBI SRA who wants to ensure that their files have not been corrupted during downloading.

Additional info

[1] How to get ENA metadata with MD5 checksums

Use the following script to download the metadata (including MD5 checksums) for projects from ENA:

import requests
import os

# Define the list of project accessions
project_accessions = ['PRJEB1787', 'PRJEB8682', 'PRJNA656268', 'PRJNA385854', 'PRJEB52452', 'PRJNA385855', 'PRJNA385855', 'PRJNA16339', 'PRJNA352737', 'PRJEB2064']

# Define the API endpoint for the project metadata
url = "https://www.ebi.ac.uk/ena/portal/api/search"

# Define the common parameters for the API request
fields = 'accession,altitude,assembly_quality,assembly_software,base_count,binning_software,bio_material,broker_name,cell_line,cell_type,center_name,checklist,collected_by,collection_date,completeness_score,contamination_score,country,cultivar,culture_collection,depth,description,dev_stage,ecotype,elevation,environment_biome,environment_feature,environment_material,environmental_sample,experiment_accession,experiment_alias,experiment_title,experimental_factor,fastq_aspera,fastq_bytes,fastq_ftp,fastq_galaxy,fastq_md5,first_created,first_public,germline,host,host_body_site,host_genotype,host_gravidity,host_growth_conditions,host_phenotype,host_sex,host_status,host_tax_id,identified_by,instrument_model,instrument_platform,investigation_type,isolate,isolation_source,last_updated,lat,library_layout,library_name,library_selection,library_source,library_strategy,location,lon,mating_type,nominal_length,nominal_sdev,ph,project_name,protocol_label,read_count,run_accession,run_alias,salinity,sample_accession,sample_alias,sample_collection,sample_description,sample_material,sample_title,sampling_campaign,sampling_platform,sampling_site,scientific_name,secondary_sample_accession,secondary_study_accession,sequencing_method,serotype,serovar,sex,specimen_voucher,sra_aspera,sra_bytes,sra_ftp,sra_galaxy,sra_md5,strain,study_accession,study_alias,study_title,sub_species,sub_strain,submission_accession,submitted_aspera,submitted_bytes,submitted_format,submitted_ftp,submitted_galaxy,submitted_host_sex,submitted_md5,target_gene,tax_id,taxonomic_classification,taxonomic_identity_marker,temperature,tissue_lib,tissue_type,variety'
common_params = {
    'result': 'read_run',
    'fields': fields,
    'format': 'tsv',
    'limit': '0'
}

# Loop through each project accession
for project_accession in project_accessions:
    # Update the query parameter for each project
    params = common_params.copy()
    params['query'] = f'study_accession={project_accession}'
    
    # Send the API request
    response = requests.get(url, params=params)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Define the path to the output file
        output_file_path = os.path.join('..', 'data', f'{project_accession}_metadata.csv')
        
        # Save the metadata to the specified file
        with open(output_file_path, 'w') as file:
            file.write(response.text)
        print(f"Metadata for {project_accession} has been successfully downloaded and saved to '{output_file_path}'.")
    else:
        print(f"Failed to retrieve metadata for {project_accession}: {response.status_code}")
        print(response.text)

[2] Investigation on unzipping and re-zipping changing checksums

The sra-download workflow downloads the unzipped files and then zips them. The MD5 checksums in the metadata file are for the zipped files. The question is whether unzipping and re-zipping changes the checksum. Let's test it:

ls
# SRR5787991_1.fastq.gz  SRR5787991_2.fastq.gz  ftp-links-for-raw-data-files.txt

md5sum *.gz
# b47f9d6458b84edacc9b1608eb020732  SRR5787991_1.fastq.gz
# d8bb8f1726bb7173a359be928c38bdba  SRR5787991_2.fastq.gz

gunzip *_1.fastq.gz
ls
# SRR5787991_1.fastq  SRR5787991_2.fastq.gz  ftp-links-for-raw-data-files.txt

gzip *fastq
ls
# SRR5787991_1.fastq.gz  SRR5787991_2.fastq.gz  ftp-links-for-raw-data-files.txt

md5sum *_1.fastq.gz
# e623cf7b937d449a834364194cf4e795  SRR5787991_1.fastq.gz

MD5 checksums change due to unzipping and re-zipping, hence the provided checksums are not applicable. Using the MD5 checksums of the .sra files ensures integrity.

Alternative to using SRA

As a sanity check (and to exclude the possibility that the md5 checksums I got from ENA are incorrect), I tried a different method for downloading some of the data with wget and the FTP links to each run. For these downloads, the MD5 checksums matched.

This is how one would use the ENA API and wget to avoid needing to rely on the SRA toolkit.

One needs the ftp links (which you get in [1]) that one would add to a file such as ftp-links-raw-files.txt, which could look like:

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR315/ERR315858/ERR315858_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR315/ERR315858/ERR315858_2.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR315/ERR315861/ERR315861_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR315/ERR315861/ERR315861_2.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR315/ERR315862/ERR315862_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR315/ERR315862/ERR315862_2.fastq.gz

and then you could simply iterate through it:

for url in `cat ftp-links-for-raw-data-files.txt`
do
    wget $url
done

Then check the md5 checksums of all downloaded .fastq files md5sum *.fastq and compare those against the md5 checksums given in the ENA metadata.

And while this is a very viable option for which we could also write a workflow, it would be nice to also have the SRA option.

Jun 21 '24 13:06 raissameyer