sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

sourmash signature manifest hangs for some signature zip file

Open zxl124 opened this issue 8 months ago • 4 comments
trafficstars

I was trying to compare two signature collections for a list of genomes in one but not the other. I was using sourmash signature manifest to get a list of genomes in each. It worked for certain files, for example, the protozoa and fungi files available on sourmash's prepared databases page. Usually it takes a few minutes, up to an hour to get the list. But running the same command on the files for Genbank viral signatures will simply hang for more than 48 hours.

Command used:

sourmash signature manifest genbank-2022.03-viral-k21.zip -o viral_manifest.txt

Command output:


== This is sourmash version 4.8.2. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==


It just hangs like this. No output file was produced. Version: 4.8.2

Also, am I doing this correctly? Is there a better way to compare two signature collections for difference in genomes in each?

zxl124 avatar Mar 19 '25 23:03 zxl124

I suspect if you supply --no-rebuild-manifest things will go much faster - see sourmash sig manifest help output,

The manifest will be rebuilt by iterating over the signatures in the file unless --no-rebuild-manifest is specified; for large collections, rebuilding the manifest can take a long time!

(Yes, this is dumb default behavior; my apologies 😭 - see https://github.com/sourmash-bio/sourmash/issues/2034)

ctb avatar Mar 19 '25 23:03 ctb

sig manifest is probably the simplest way to do this. Ideally sig describe --csv would let you do this too, but it will be slow.

You can also use sig collect if you want to work with multiple files and produce standalone manifests, which are potentially useful; see the docs.

But sig manifest is fine ;)

And, again, sorry for the terrible legacy behavior of rebuilding the manifest...

ctb avatar Mar 19 '25 23:03 ctb

I've rerun the command with --no-rebuild-manifest, and it is still running for days. Anything I am doing wrong?

sourmash sig manifest --no-rebuild-manifest genbank-2022.03-viral-k21.zip -o viral_manifest.txt

zxl124 avatar Mar 25 '25 18:03 zxl124

Sorry for taking so long to get back to this -

/usr/bin/time -v sourmash sig manifest --no-rebuild-manifest genbank-2022.03-viral-k21.zip -o /tmp/viral_manifest.txt

yields:

....
manifest contains 47951 signatures total.
wrote manifest to '/tmp/viral_manifest.txt' (csv)
        Command being timed: "sourmash sig manifest --no-rebuild-manifest genbank-2022.03-viral-k21.zip -o /tmp/viral_manifest.txt"
        User time (seconds): 18.94
        System time (seconds): 0.96
        Percent of CPU this job got: 492%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.04
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 281684
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 113
        Minor (reclaiming a frame) page faults: 81247
        Voluntary context switches: 4477
        Involuntary context switches: 2002
        Swaps: 0
        File system inputs: 23224
        File system outputs: 17768
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Does unzip -v genbank-2022.03-viral-k21.zip work for you?

ctb avatar Jun 22 '25 16:06 ctb