sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

sketching files containing many small sequences: `manysketch` is astonishingly fast

Open ctb opened this issue 7 months ago • 4 comments

I'm trying to sketch the RVDB, the Reference Viral Genome Database. The clustered file is ~600 MB.

sourmash scripts manysketch C-RVDBvCurrent.manysketch.csv -o C-RVDBvCurrent.manysketch.zip -p dna,k=21,scaled=1000 --singleton

took about 5 minutes.

sourmash sketch dna -p k=21 C-RVDBvCurrent.fasta.gz -o C-RVDBvCurrent.sig.zip --singleton

didn't finish in 24 hours.

what's the reason!? By my understanding manysketch isn't multithreaded when reading single FASTA files, so it's not multithreading. Presumably just the Python for loop penalty and/or using screed!? Wow.

On a mostly unrelated note, the sig.zip file is larger than the FASTA file. So that sucks.

ctb avatar Jul 14 '24 15:07 ctb