sourmash
sourmash copied to clipboard
sourmash signature manifest hangs for some signature zip file
I was trying to compare two signature collections for a list of genomes in one but not the other. I was using sourmash signature manifest to get a list of genomes in each. It worked for certain files, for example, the protozoa and fungi files available on sourmash's prepared databases page. Usually it takes a few minutes, up to an hour to get the list. But running the same command on the files for Genbank viral signatures will simply hang for more than 48 hours.
Command used:
sourmash signature manifest genbank-2022.03-viral-k21.zip -o viral_manifest.txt
Command output:
== This is sourmash version 4.8.2. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
It just hangs like this. No output file was produced. Version: 4.8.2
Also, am I doing this correctly? Is there a better way to compare two signature collections for difference in genomes in each?
I suspect if you supply --no-rebuild-manifest things will go much faster - see sourmash sig manifest help output,
The manifest will be rebuilt by iterating over the signatures in the file unless --no-rebuild-manifest is specified; for large collections, rebuilding the manifest can take a long time!
(Yes, this is dumb default behavior; my apologies 😭 - see https://github.com/sourmash-bio/sourmash/issues/2034)
sig manifest is probably the simplest way to do this. Ideally sig describe --csv would let you do this too, but it will be slow.
You can also use sig collect if you want to work with multiple files and produce standalone manifests, which are potentially useful; see the docs.
But sig manifest is fine ;)
And, again, sorry for the terrible legacy behavior of rebuilding the manifest...
I've rerun the command with --no-rebuild-manifest, and it is still running for days. Anything I am doing wrong?
sourmash sig manifest --no-rebuild-manifest genbank-2022.03-viral-k21.zip -o viral_manifest.txt
Sorry for taking so long to get back to this -
/usr/bin/time -v sourmash sig manifest --no-rebuild-manifest genbank-2022.03-viral-k21.zip -o /tmp/viral_manifest.txt
yields:
....
manifest contains 47951 signatures total.
wrote manifest to '/tmp/viral_manifest.txt' (csv)
Command being timed: "sourmash sig manifest --no-rebuild-manifest genbank-2022.03-viral-k21.zip -o /tmp/viral_manifest.txt"
User time (seconds): 18.94
System time (seconds): 0.96
Percent of CPU this job got: 492%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.04
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 281684
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 113
Minor (reclaiming a frame) page faults: 81247
Voluntary context switches: 4477
Involuntary context switches: 2002
Swaps: 0
File system inputs: 23224
File system outputs: 17768
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Does unzip -v genbank-2022.03-viral-k21.zip work for you?