datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Include md5sum in JSON or as other output

Open aboffin opened this issue 6 months ago • 1 comments

Hi,

Thank you for your team's commendable work on datasets which finally provides a comprehensive and singular way to download data from NCBI, whereas previously one had to resort to a multitude of EUtils/Perl/Python scripts that output something almost, but not quite entirely unlike what we wanted, however reliability seems to be an issue as with other tools.

Is there a way to check the integrity of the downloads? In the typical example that is given, this information does not exist:

./datasets download genome accession GCF_000001405.40 --dehydrated --filename human_GRCh38_dataset.zip
unzip human_GRCh38_dataset.zip -d GRCh38
./datasets rehydrate --directory GRCh38

cd GRCh38/ncbi_dataset/data
grep md5 *json
# outputs nothing

I am perplexed that such a simple mechanism of checksum integrity was not provided considering that networks do fail and partial downloads may lead to, at best confusion and at worst incorrect results, when using such genomes for further analyses.

I see that issue #206 raised the same question but it was closed without any definitive answer regarding md5sum.

aboffin avatar Jan 04 '24 20:01 aboffin