fetchngs icon indicating copy to clipboard operation
fetchngs copied to clipboard

`vdb-validate` does not detect file corruption

Open suhrig opened this issue 4 months ago • 5 comments

Description of the bug

As explained in https://github.com/ncbi/sra-tools/issues/896, vdb-validate does not detect file corruption if the prefetched files do not contain MD5 checksums. It has happened to me many times that downloaded files turn out to be corrupt, if I use the option force_sratools_download. What is worse is that extracting the files using fasterq-dump does not always result in an error even if the file is corrupt. It is even conceivable that the extracted FastQ file looks perfectly intact with only some bases or quality values being changed. As such, the error may go completely unnoticed.

I propose that the validation procedure be changed. Namely, I find that using the following curl command to fetch the MD5 sum of the prefetched SRA file and then using the md5sum command-line utility to confirm the checksum should be more reliable:

curl 'https://locate.ncbi.nlm.nih.gov/sdl/2/retrieve?filetype=run&acc=SRRxxxxxxx'

Admittedly, I don't know whether there are situations where the MD5 sum cannot be obtained via the above curl command. Maybe it would be best to first try to obtain the MD5 sum, and if this fails use the current vdb-validate command as a fallback.

Possibly, I will find time to submit a PR. I'm reporting this here in case someone else is faster.

Command used and terminal output

No response

Relevant files

No response

System information

No response

suhrig avatar Feb 20 '24 19:02 suhrig