sourmash
sourmash copied to clipboard
DNA sketching in singleton mode, but only with zip compression
Greetings!
I think I read somewhere in the sourmash documentation that zip compression is preferred for sketches because in practical terms it resulted in the smallest sig files. After some recent testing, I wanted to report an observation: running sourmash sketch dna --singleton is orders of magnitude slower writing to a .sig.zip than to a .sig or .sig.gz file. I didn't observe this behavior in non-singleton mode, nor when I performed the file system equivalent of singleton mode, i.e. split a single FASTA file with hundreds of thousands of sequences into numerous FASTA files each with a single sequence.
In my most recent testing, sourmash sketch dna --singleton ran successfully in 50-51 seconds for both .sig and .sig.gz output. For .sig.zip output, it has been running for over 20 minutes now.
$ time sourmash sketch dna reference-plasmids.fasta --singleton -o reference-plasmids.sig
== This is sourmash version 4.8.14. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==
computing signatures for files: reference-plasmids.fasta
Computing a total of 1 signature(s) for each input.
calculated 268043 signatures for 268043 sequences in reference-plasmids.fasta
saved 268043 signature(s) to 'reference-plasmids.sig'. Note: signature license is CC0.
real 0m50.423s
user 0m49.148s
sys 0m1.039s
$ time sourmash sketch dna reference-plasmids.fasta --singleton -o reference-plasmids.sig.gz
== This is sourmash version 4.8.14. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==
computing signatures for files: reference-plasmids.fasta
Computing a total of 1 signature(s) for each input.
calculated 268043 signatures for 268043 sequences in reference-plasmids.fasta
saved 268043 signature(s) to 'reference-plasmids.sig.gz'. Note: signature license is CC0.
real 0m51.101s
user 0m49.958s
sys 0m0.882s
$
$ time sourmash sketch dna reference-plasmids.fasta --singleton -o reference-plasmids.sig.zip
== This is sourmash version 4.8.14. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==
computing signatures for files: reference-plasmids.fasta
Computing a total of 1 signature(s) for each input.
Current relative file sizes don't match expectations either.
$ ls -lh reference-plasmids.sig*
-rw-r--r-- 1 daniel.standage staff 127M Mar 26 10:13 reference-plasmids.sig
-rw-r--r-- 1 daniel.standage staff 11M Mar 26 10:15 reference-plasmids.sig.gz
-rw-r--r-- 1 daniel.standage staff 37M Mar 26 10:36 reference-plasmids.sig.zip
This isn't an obstacle to me using sourmash—I'm very happy gzip compression is supported—but it's been a sufficiently consistent observation that I thought I should report it.
Running on MacOS 15.3.2 in case that's important.
super strange! thanks for reporting! definitely seems like a bug 😭
You might try the singlesketch command from the branchwater plugin, note.