sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

prepared databases may be incorrect/corrupted?

Open aboffin opened this issue 1 year ago • 10 comments

Hi,

When I run sourmash gather as follows, I get the above error message:

$ sourmash gather --ksize 51 --dna ../dat/query/query.sig ../dat/db/k51/genbank-2022.03-bacteria-k51.zip -o results_test.txt

== This is sourmash version 4.4.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting specified query k=51
loaded query: query... (k=51, DNA)
Traceback (most recent call last):-2022.03-bacteria-k51.zip...
  File "/cb/run/miniconda3/bin/sourmash", line 8, in <module>
    sys.exit(main())
  File "/cb/run/miniconda3/lib/python3.9/site-packages/sourmash/__main__.py", line 13, in main
    return mainmethod(args)
  File "/cb/run/miniconda3/lib/python3.9/site-packages/sourmash/cli/gather.py", line 147, in main
    return sourmash.commands.gather(args)
  File "/cb/run/miniconda3/lib/python3.9/site-packages/sourmash/commands.py", line 695, in gather
    databases = sourmash_args.load_dbs_and_sigs(args.databases, query, False,
  File "/cb/run/miniconda3/lib/python3.9/site-packages/sourmash/sourmash_args.py", line 322, in load_dbs_and_sigs
    if not db:
  File "/cb/run/miniconda3/lib/python3.9/site-packages/sourmash/index/__init__.py", line 569, in __bool__
    next(iter(self.signatures()))
  File "/cb/run/miniconda3/lib/python3.9/site-packages/sourmash/index/__init__.py", line 638, in signatures
    data = self.storage.load(filename)
  File "/cb/run/miniconda3/lib/python3.9/site-packages/sourmash/sbt_storage.py", line 158, in load
    rawbuf = self._methodcall(lib.zipstorage_load, to_bytes(path), len(path), size)
  File "/cb/run/miniconda3/lib/python3.9/site-packages/sourmash/utils.py", line 25, in _methodcall
    return rustcall(func, self._get_objptr(), *args)
  File "/cb/run/miniconda3/lib/python3.9/site-packages/sourmash/utils.py", line 78, in rustcall
    raise exc
sourmash.exceptions.Panic: sourmash panicked: thread 'unnamed' panicked with 'assertion failed: `(left == right)`
  left: `[0, 0, 0, 0]`,
 right: `[80, 75, 3, 4]`' at /root/.cargo/registry/src/github.com-1ecc6299db9ec823/piz-0.4.0/src/spec.rs:651

I could not find any related errors in the issues, any pointers to resolve this is appreciated. (The same query signature works against a GenBank fungal prepared database as expected).

aboffin avatar Jul 22 '22 19:07 aboffin

well that is exciting! I haven't seen that error before.

Does the bacterial genbank zip file look ok when you do an unzip -v on it? The first thing I can think of is that the zip file is somehow corrupted.

ctb avatar Jul 23 '22 11:07 ctb

The bacterial Genbank zip file seems to be ok (first 11 lines of output shown).

$ unzip -v genbank-2022.03-bacteria-k51.zip | sed 11q
Archive:  genbank-2022.03-bacteria-k51.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
   21207  Stored    21207   0% 03-29-2022 08:19 15914876  signatures/46bd9dca15ad7b06b85e8c6aa1dfc5bf.sig.gz
   56312  Stored    56312   0% 03-29-2022 08:19 992a43ed  signatures/1eb5a47f92db27c870944837cfe56186.sig.gz
   15472  Stored    15472   0% 03-29-2022 08:19 95afa1fe  signatures/f0bef8a28010a54c50d02b884cab1e8d.sig.gz
   39033  Stored    39033   0% 03-29-2022 08:19 f1489e45  signatures/84aad26da19a645b20a8adfc31863e37.sig.gz
   40071  Stored    40071   0% 03-29-2022 08:19 a461bbe3  signatures/3df28795f09062da9901160bbc86cb06.sig.gz
   37123  Stored    37123   0% 03-29-2022 08:19 e321a654  signatures/c4f2d6791ab63a8a4a4991b6591e2f4f.sig.gz
   44645  Stored    44645   0% 03-29-2022 08:19 a35000d3  signatures/a34c7043ea61d9eb6c2891b3797c27e2.sig.gz
   42110  Stored    42110   0% 03-29-2022 08:19 68d6d0cd  signatures/9ab6351596cae07817a75eec14307d07.sig.gz

And zip archive comment appears to be ok ( -z display only the archive comment; this also appears with -v above)

$ unzip -z  genbank-2022.03-bacteria-k51.zip
Archive:  genbank-2022.03-bacteria-k51.zip

As a control, I tested a partial fungal Genbank file and it shows what seems to be an expected error message with a corrupted zip file (both unzip -v and unzip -z produce the same output):

$ unzip -z genbank-2022.03-fungi-k51.zip.partial
Archive:  genbank-2022.03-fungi-k51.zip.partial
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of genbank-2022.03-fungi-k51.zip.partial or
        genbank-2022.03-fungi-k51.zip.partial.zip, and cannot find genbank-2022.03-fungi-k51.zip.partial.ZIP, period.

aboffin avatar Jul 23 '22 13:07 aboffin

Hmm. I wonder if there was any issue when uploading the bacteria DB, the copy I have here (and was the one used for the upload) doesn't generate these errors and goes further along in gather. Will investigate.

luizirber avatar Jul 23 '22 17:07 luizirber

@luizirber, if it is helpful to troubleshoot, here is the output of du:

$ du genbank-2022.03-bacteria-k51.zip 
38585372	genbank-2022.03-bacteria-k51.zip

and the md5 checksum:

$ md5sum genbank-2022.03-bacteria-k51.zip >out;
$ cat out
867f2bab5481bb91fabd060e931a30c1  genbank-2022.03-bacteria-k51.zip

Thanks for looking into this!

aboffin avatar Jul 23 '22 23:07 aboffin

I reuploaded it and it was the same IPFS hash (so I don't think the upload failed), but the MD5 for my local file doesn't match yours!

$ md5sum genbank-2022.03-bacteria-k51.zip
c7ef7c815a00337a7252ab49ffab3e8b  genbank-2022.03-bacteria-k51.zip

I'll try to download it too and see if the MD5 matches yours or mine...

luizirber avatar Jul 24 '22 00:07 luizirber

Seems like https://dweb.link is quite slow, one alternative is to use another gateway (like https://cloudflare-ipfs.com) to download it (still slow, but faster...). Command for this option: wget -c https://cloudflare-ipfs.com/ipfs/bafybeie3eyyectnh5xqxz44oa3qj5vura3bffqdwfqk6jjuzzadkh7e2sq

If you really want speed, ipget will use IPFS directly, and seems to be running 6x faster than the cloudflare gateway.

luizirber avatar Jul 24 '22 00:07 luizirber

I got c7ef7c815a00337a7252ab49ffab3e8b as MD5 with ipget, I'll leave wget running and report back when it's done. But I think your download might be corrupted =(

luizirber avatar Jul 24 '22 01:07 luizirber

Thank you @luizirber! I downloaded the file from the IPFS gateway with ipget and the md5sum is matching to the one from above: c7ef7c815a00337a7252ab49ffab3e8b.

I also tested this database using a query signature using sourmash gather, similar to my original question and it is running without any errors.

I am curious to know what the md5sum of the dweb.link file is, to see if the file there is corrupted.

Also if possible, it might be useful to provide md5sum signatures for all download files.

Thank you again for your help in resolving this issue!

aboffin avatar Jul 24 '22 18:07 aboffin

I think I see ~3 followup items from this -

  • provide md5sums for files
  • change links to something other than dweb
  • provide more details on getting the files via e.g. ipfs

am I missing any? :)

ctb avatar Jul 25 '22 10:07 ctb

Seems like https://dweb.link is quite slow, one alternative is to use another gateway (like https://cloudflare-ipfs.com) to download it (still slow, but faster...). Command for this option: wget -c https://cloudflare-ipfs.com/ipfs/bafybeie3eyyectnh5xqxz44oa3qj5vura3bffqdwfqk6jjuzzadkh7e2sq

If you really want speed, ipget will use IPFS directly, and seems to be running 6x faster than the cloudflare gateway.

Hi @luizirber, do you have the cloudflare link for genbank-2022.03-bacteria-k31.zip? ref #2179

bluegenes avatar Aug 08 '22 14:08 bluegenes

download problems hopefully fixed by #2255.

Docs updated: prepared databases page here now has robustified farm links.

ctb avatar Sep 03 '22 16:09 ctb