CMash
CMash copied to clipboard
Gzipping all training files results in a nice reduction: add feature that allows scripts/modules to handle this
For example, using the Metalign default training database (199807 genomes) and running
python MakeStreamingDNADatabase.py ${trainingFiles} ${outputDir}/${cmashDatabase} -n ${numHashes} -k 60 -v
python MakeStreamingPrefilter.py ${outputDir}/${cmashDatabase} ${outputDir}/${prefilterName} 30-60-10
results in uncompressed:
16G Mar 22 03:39 cmash_db_n1000_k60.h5
9.3G Mar 22 08:07 cmash_db_n1000_k60_30-60-10.bf
6.9G Mar 22 04:34 cmash_db_n1000_k60.tst
yet
4.6G Mar 22 03:39 cmash_db_n1000_k60.h5.gz
3.6G Mar 22 08:07 cmash_db_n1000_k60_30-60-10.bf.gz
3.6G Mar 22 04:34 cmash_db_n1000_k60.tst.gz
so ~2-4x compression.
Would need to either:
- [ ] Enable
MakeStreamingDNADatabase.pyandMakeStreamingPrefilter.pyto detect compressed training data and decompress it in the script or (better yet) - [ ] Enable decompression in the modules MinHash.py and Query.py themselves.