sourmash
sourmash copied to clipboard
prepared databases may be incorrect/corrupted?
Hi,
When I run sourmash gather
as follows, I get the above error message:
$ sourmash gather --ksize 51 --dna ../dat/query/query.sig ../dat/db/k51/genbank-2022.03-bacteria-k51.zip -o results_test.txt
== This is sourmash version 4.4.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
selecting specified query k=51
loaded query: query... (k=51, DNA)
Traceback (most recent call last):-2022.03-bacteria-k51.zip...
File "/cb/run/miniconda3/bin/sourmash", line 8, in <module>
sys.exit(main())
File "/cb/run/miniconda3/lib/python3.9/site-packages/sourmash/__main__.py", line 13, in main
return mainmethod(args)
File "/cb/run/miniconda3/lib/python3.9/site-packages/sourmash/cli/gather.py", line 147, in main
return sourmash.commands.gather(args)
File "/cb/run/miniconda3/lib/python3.9/site-packages/sourmash/commands.py", line 695, in gather
databases = sourmash_args.load_dbs_and_sigs(args.databases, query, False,
File "/cb/run/miniconda3/lib/python3.9/site-packages/sourmash/sourmash_args.py", line 322, in load_dbs_and_sigs
if not db:
File "/cb/run/miniconda3/lib/python3.9/site-packages/sourmash/index/__init__.py", line 569, in __bool__
next(iter(self.signatures()))
File "/cb/run/miniconda3/lib/python3.9/site-packages/sourmash/index/__init__.py", line 638, in signatures
data = self.storage.load(filename)
File "/cb/run/miniconda3/lib/python3.9/site-packages/sourmash/sbt_storage.py", line 158, in load
rawbuf = self._methodcall(lib.zipstorage_load, to_bytes(path), len(path), size)
File "/cb/run/miniconda3/lib/python3.9/site-packages/sourmash/utils.py", line 25, in _methodcall
return rustcall(func, self._get_objptr(), *args)
File "/cb/run/miniconda3/lib/python3.9/site-packages/sourmash/utils.py", line 78, in rustcall
raise exc
sourmash.exceptions.Panic: sourmash panicked: thread 'unnamed' panicked with 'assertion failed: `(left == right)`
left: `[0, 0, 0, 0]`,
right: `[80, 75, 3, 4]`' at /root/.cargo/registry/src/github.com-1ecc6299db9ec823/piz-0.4.0/src/spec.rs:651
I could not find any related errors in the issues, any pointers to resolve this is appreciated. (The same query signature works against a GenBank fungal prepared database as expected).
well that is exciting! I haven't seen that error before.
Does the bacterial genbank zip file look ok when you do an unzip -v
on it? The first thing I can think of is that the zip file is somehow corrupted.
The bacterial Genbank zip file seems to be ok (first 11 lines of output shown).
$ unzip -v genbank-2022.03-bacteria-k51.zip | sed 11q
Archive: genbank-2022.03-bacteria-k51.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
21207 Stored 21207 0% 03-29-2022 08:19 15914876 signatures/46bd9dca15ad7b06b85e8c6aa1dfc5bf.sig.gz
56312 Stored 56312 0% 03-29-2022 08:19 992a43ed signatures/1eb5a47f92db27c870944837cfe56186.sig.gz
15472 Stored 15472 0% 03-29-2022 08:19 95afa1fe signatures/f0bef8a28010a54c50d02b884cab1e8d.sig.gz
39033 Stored 39033 0% 03-29-2022 08:19 f1489e45 signatures/84aad26da19a645b20a8adfc31863e37.sig.gz
40071 Stored 40071 0% 03-29-2022 08:19 a461bbe3 signatures/3df28795f09062da9901160bbc86cb06.sig.gz
37123 Stored 37123 0% 03-29-2022 08:19 e321a654 signatures/c4f2d6791ab63a8a4a4991b6591e2f4f.sig.gz
44645 Stored 44645 0% 03-29-2022 08:19 a35000d3 signatures/a34c7043ea61d9eb6c2891b3797c27e2.sig.gz
42110 Stored 42110 0% 03-29-2022 08:19 68d6d0cd signatures/9ab6351596cae07817a75eec14307d07.sig.gz
And zip archive comment appears to be ok ( -z display only the archive comment
; this also appears with -v
above)
$ unzip -z genbank-2022.03-bacteria-k51.zip
Archive: genbank-2022.03-bacteria-k51.zip
As a control, I tested a partial fungal Genbank file and it shows what seems to be an expected error message with a corrupted zip file (both unzip -v
and unzip -z
produce the same output):
$ unzip -z genbank-2022.03-fungi-k51.zip.partial
Archive: genbank-2022.03-fungi-k51.zip.partial
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of genbank-2022.03-fungi-k51.zip.partial or
genbank-2022.03-fungi-k51.zip.partial.zip, and cannot find genbank-2022.03-fungi-k51.zip.partial.ZIP, period.
Hmm. I wonder if there was any issue when uploading the bacteria DB, the copy I have here (and was the one used for the upload) doesn't generate these errors and goes further along in gather
. Will investigate.
@luizirber, if it is helpful to troubleshoot, here is the output of du
:
$ du genbank-2022.03-bacteria-k51.zip
38585372 genbank-2022.03-bacteria-k51.zip
and the md5 checksum:
$ md5sum genbank-2022.03-bacteria-k51.zip >out;
$ cat out
867f2bab5481bb91fabd060e931a30c1 genbank-2022.03-bacteria-k51.zip
Thanks for looking into this!
I reuploaded it and it was the same IPFS hash (so I don't think the upload failed), but the MD5 for my local file doesn't match yours!
$ md5sum genbank-2022.03-bacteria-k51.zip
c7ef7c815a00337a7252ab49ffab3e8b genbank-2022.03-bacteria-k51.zip
I'll try to download it too and see if the MD5 matches yours or mine...
Seems like https://dweb.link is quite slow, one alternative is to use another gateway (like https://cloudflare-ipfs.com) to download it (still slow, but faster...). Command for this option:
wget -c https://cloudflare-ipfs.com/ipfs/bafybeie3eyyectnh5xqxz44oa3qj5vura3bffqdwfqk6jjuzzadkh7e2sq
If you really want speed, ipget will use IPFS directly, and seems to be running 6x faster than the cloudflare gateway.
I got c7ef7c815a00337a7252ab49ffab3e8b
as MD5 with ipget
, I'll leave wget
running and report back when it's done. But I think your download might be corrupted =(
Thank you @luizirber! I downloaded the file from the IPFS gateway with ipget
and the md5sum
is matching to the one from above: c7ef7c815a00337a7252ab49ffab3e8b
.
I also tested this database using a query signature using sourmash gather
, similar to my original question and it is running without any errors.
I am curious to know what the md5sum
of the dweb.link
file is, to see if the file there is corrupted.
Also if possible, it might be useful to provide md5sum
signatures for all download files.
Thank you again for your help in resolving this issue!
I think I see ~3 followup items from this -
- provide md5sums for files
- change links to something other than dweb
- provide more details on getting the files via e.g. ipfs
am I missing any? :)
Seems like https://dweb.link is quite slow, one alternative is to use another gateway (like https://cloudflare-ipfs.com) to download it (still slow, but faster...). Command for this option:
wget -c https://cloudflare-ipfs.com/ipfs/bafybeie3eyyectnh5xqxz44oa3qj5vura3bffqdwfqk6jjuzzadkh7e2sq
If you really want speed, ipget will use IPFS directly, and seems to be running 6x faster than the cloudflare gateway.
Hi @luizirber, do you have the cloudflare link for genbank-2022.03-bacteria-k31.zip
? ref #2179
download problems hopefully fixed by #2255.
Docs updated: prepared databases page here now has robustified farm links.