sourmash
sourmash copied to clipboard
sketching files containing many small sequences: `manysketch` is astonishingly fast
I'm trying to sketch the RVDB, the Reference Viral Genome Database. The clustered file is ~600 MB.
sourmash scripts manysketch C-RVDBvCurrent.manysketch.csv -o C-RVDBvCurrent.manysketch.zip -p dna,k=21,scaled=1000 --singleton
took about 5 minutes.
sourmash sketch dna -p k=21 C-RVDBvCurrent.fasta.gz -o C-RVDBvCurrent.sig.zip --singleton
didn't finish in 24 hours.
what's the reason!? By my understanding manysketch
isn't multithreaded when reading single FASTA files, so it's not multithreading. Presumably just the Python for loop penalty and/or using screed!? Wow.
On a mostly unrelated note, the sig.zip file is larger than the FASTA file. So that sucks.