myvariant.info
myvariant.info copied to clipboard
Add MD5 checksum utility
Datasources in myvariant.io may contain large files to download (e.g. dbsnp
release 155 has 380GB). Due to various reasons (like FTP connection issues), the download may be incomplete, leading errors during the uploading processes.
Some of the datasource has MD5 checksum files available. It would be nice to download those *.md5
files as well and validate the data files in the post_dump
phrases.
The validation in bash is quite straight forward. Each .md5
file is essentially a tuple of (checksum, filename)
. md5sum -c
will read a .md5
file, re-calculate the checksum for that filename
and match it with the origin checksum
. E.g.
(venv) myvariant@su09:/data/hub/myvariant_hub/dbsnp/155$ cat refsnp-chr16.json.bz2.md5
aa8f0ec9c4752ea34dff2ae309d2a239 refsnp-chr16.json.bz2
(venv) myvariant@su09:/data/hub/myvariant_hub/dbsnp/155$ md5sum -c refsnp-chr16.json.bz2.md5
refsnp-chr16.json.bz2: OK
It's also feasible in python with built-in hashlib.md5()
. See Generating an MD5 checksum of a file. Performance of feeding file content to hashlib
should be taken into account before developing a MD5 helper class/function.
@erikyao Looks like post_download
or post_dump
might be the good place to add the md5 check for a data src like dbsnp:
https://github.com/biothings/biothings.api/blob/181a36fc2d5f782bb3608ec032891b0eaa9e7e1d/biothings/hub/dataload/dumper.py#L162-L172
see an example here from mychem: https://github.com/biothings/mychem.info/commit/79f704b4424036d06014798a2f143bcfab4617e8 and a small change https://github.com/biothings/mychem.info/commit/6fa89376efff4fb1886253f18f97d11d9e1b8889