sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

DNA sketching in singleton mode, but only with zip compression

Open standage opened this issue 8 months ago • 1 comments

Greetings!

I think I read somewhere in the sourmash documentation that zip compression is preferred for sketches because in practical terms it resulted in the smallest sig files. After some recent testing, I wanted to report an observation: running sourmash sketch dna --singleton is orders of magnitude slower writing to a .sig.zip than to a .sig or .sig.gz file. I didn't observe this behavior in non-singleton mode, nor when I performed the file system equivalent of singleton mode, i.e. split a single FASTA file with hundreds of thousands of sequences into numerous FASTA files each with a single sequence.

In my most recent testing, sourmash sketch dna --singleton ran successfully in 50-51 seconds for both .sig and .sig.gz output. For .sig.zip output, it has been running for over 20 minutes now.

$ time sourmash sketch dna reference-plasmids.fasta --singleton -o reference-plasmids.sig

== This is sourmash version 4.8.14. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

computing signatures for files: reference-plasmids.fasta
Computing a total of 1 signature(s) for each input.
calculated 268043 signatures for 268043 sequences in reference-plasmids.fasta
saved 268043 signature(s) to 'reference-plasmids.sig'. Note: signature license is CC0.

real    0m50.423s
user    0m49.148s
sys     0m1.039s
$ time sourmash sketch dna reference-plasmids.fasta --singleton -o reference-plasmids.sig.gz

== This is sourmash version 4.8.14. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

computing signatures for files: reference-plasmids.fasta
Computing a total of 1 signature(s) for each input.
calculated 268043 signatures for 268043 sequences in reference-plasmids.fasta
saved 268043 signature(s) to 'reference-plasmids.sig.gz'. Note: signature license is CC0.

real    0m51.101s
user    0m49.958s
sys     0m0.882s
$ 
$ time sourmash sketch dna reference-plasmids.fasta --singleton -o reference-plasmids.sig.zip

== This is sourmash version 4.8.14. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

computing signatures for files: reference-plasmids.fasta
Computing a total of 1 signature(s) for each input.

Current relative file sizes don't match expectations either.

$ ls -lh reference-plasmids.sig*
-rw-r--r-- 1 daniel.standage staff 127M Mar 26 10:13 reference-plasmids.sig
-rw-r--r-- 1 daniel.standage staff  11M Mar 26 10:15 reference-plasmids.sig.gz
-rw-r--r-- 1 daniel.standage staff  37M Mar 26 10:36 reference-plasmids.sig.zip

This isn't an obstacle to me using sourmash—I'm very happy gzip compression is supported—but it's been a sufficiently consistent observation that I thought I should report it.

Running on MacOS 15.3.2 in case that's important.

standage avatar Mar 26 '25 14:03 standage

super strange! thanks for reporting! definitely seems like a bug 😭

You might try the singlesketch command from the branchwater plugin, note.

ctb avatar Mar 26 '25 15:03 ctb