datasets icon indicating copy to clipboard operation
datasets copied to clipboard

verification/validation of downloaded data.

Open janpb opened this issue 2 years ago • 3 comments

I emailed this question to nlm-support a couple of days earlier. Sorry for doubleposting, but I think this is a better way to ask my question

feature/enhancement: verification/validation of downloaded data.

For me it's important to know of the data has been transferred properly and would like to know if there is a method in Datasets to verify the downloaded data, e.g. to check if a genome has been properly downloaded using a checksum or similar.

I did also notice that already completed ncbi_datasets downloads will be downloaded again. If I run datasets in a simple pipeline, I need to work around it, e.g., renaming directories so they won't be picked up when rerunning the script.

If a validation option is available, this could facilitate pipelines and I could sleep better at night knowing I did download the proper data.

I did look into Datasets but couldn't find methods performing such a verification. However, I'm not familiar with Go and could have missed it.

Cheers

janpb avatar Jul 22 '21 19:07 janpb

Hi Dr. Buchmann,

Thanks again for your feedback.

Regarding your suggestion for some way to know the data has been transferred properly, I have discussed this with our development team and there are several different possibilities that we see. We would like you to please clarify your use case--which type of checksum would be most useful for you?

Please see the below detailed list of possible ways that data can be verified with our current support (or lack of support) for each:

  1. Verifying that a downloaded NCBI Data package was transferred completely and without error:
  • We are using industry standard zip archives, which include checksums as part of that file format. You may test integrity via Zip tools, e.g. unzip -t my_data_package.zip should return: No errors detected in compressed data of my_data_package.zip.
  • Beware on macOS that we have identified and reported to Apple several bugs with macOS native decompression tools, affecting Safari, Archive Utility, and ditto. Please use unzip as above to test Zip archive integrity.
  1. Verifying that downloaded NCBI data from our REST API, for formats other than a Zip Data package, was transferred completely and without error:
  • Formats such as CSV, TSV, JSON, and JSON Lines do not have an industry standard mechanism for including and testing integrity via checksums.
  • JSON allows identifying a partial response, since any container structure must have a matching closing element (closing brace or bracket).
  • For the other formats, the only mechanism is trust in the transport layer of the HTTPS web protocol, which is limited. It is technically challenging to return 2 responses from a single HTTPS request, to provide a separate checksum, unless using an archive format such as Zip Data packages (as above).
  • The HTTPS protocol is being enhance to support Digest Headers (the older Content-MD5 checksums were deprecated), per the Digest Headers Working Group. However, this is not yet an approved RFC.
  1. Verifying that files on disk, originating from a (full) Data package that was extracted, are complete and without error:
  • We do not currently provide a mechanism to check the contents of an extracted Data package.
  • A possible method to check the contents of a previously extracted Data package, is to unzip the original Data package in another directory, and diff -r the two copies.
  1. Verifying that files on disk, from a Dehydrated Data package (i.e. one that was too large to download from the web UI) that was rehydrated (via the datasets tool), is complete and without error:
  • Currently, the datasets tool does not support checking the validity of a dehydrated Data package which was rehydrated.
  1. Verifying that specific elements of contents in files, for example a single record within a FASTA file, are complete and without error:
  • Currently, our file formats such as FASTA do not include per-record checksums. While checksums could be included via the defline, as one possibility, there is no standard practice for this.
  1. Providing an API to request checksums or verify update metadata of a record:
  • Currently, we do not provide an API for fetching only checksums or other minimal metadata for a given request. The HTTPS protocol supports notions such as HEAD requests and E-Tag headers for such purposes, but we have not implemented any specific APIs to provide this information for arbitrary queries. For Dehydrated Data Packages which reference files on our FTP site, the HTTPS server should be satisfying HEAD and E-Tag protocol, but this only covers a portion of the content served by NCBI Datasets.

If any of those use cases are among your interest, please let us know. If you have other use cases, please feel free to describe them to us. Thanks.

Best, Eric

Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI

ericcox1 avatar Jul 26 '21 17:07 ericcox1

Dear Dr. Cox,

thanks for the detailed answer. My use case would be Number 4: verify the completeness of a rehydrated dataset.

I did hope to have a checksum for each file, e.g. md5 or some sha, listed in either ncbi_dataset/fetch.txt or even better, in ncbi_dataset/data/dataset_catalog.json. The latter already lists file path and file type for each accession. While it would be very luxurious to have the verification done in datasets, just having the checksums would be a great advantage.

The testing can be done by the user, e.g. create an md5 checksum and compare it to the checksum in ncbi_dataset/data/dataset_catalog.json.

Genome data on the FTP server, e.g. Arabidopsis thaliana contain a file md5checksums, listing the checksum for each file in the corresponding directory.

It's possible to extract the accession from ncbi_dataset/fetch.txt or ncbi_dataset/data/dataset_catalog.json, adjust and assemble the ftp path, download the md5file from the ftp path, and check the checksums. However, this is rather convoluted and requires a new request for each file.

janpb avatar Jul 27 '21 07:07 janpb

Dr. Buchmann,

Yes, I can see how having the checksum to verify the completeness of a rehydrated data package would be helpful. I have passed this on to our development team and we hope to get a cost estimate soon. I'll keep you posted. Thanks again for your feedback.

-Eric

ericcox1 avatar Jul 27 '21 13:07 ericcox1