Gzipping all training files results in a nice reduction: add feature that allows scripts/modules to handle this

Open dkoslicki opened this issue 5 years ago • 0 comments

For example, using the Metalign default training database (199807 genomes) and running

python MakeStreamingDNADatabase.py ${trainingFiles} ${outputDir}/${cmashDatabase} -n ${numHashes} -k 60 -v
python MakeStreamingPrefilter.py ${outputDir}/${cmashDatabase} ${outputDir}/${prefilterName} 30-60-10

results in uncompressed:

16G Mar 22 03:39 cmash_db_n1000_k60.h5
9.3G Mar 22 08:07 cmash_db_n1000_k60_30-60-10.bf
6.9G Mar 22 04:34 cmash_db_n1000_k60.tst

yet

4.6G Mar 22 03:39 cmash_db_n1000_k60.h5.gz
3.6G Mar 22 08:07 cmash_db_n1000_k60_30-60-10.bf.gz
3.6G Mar 22 04:34 cmash_db_n1000_k60.tst.gz

so ~2-4x compression.

Would need to either:

[ ] Enable MakeStreamingDNADatabase.py and MakeStreamingPrefilter.py to detect compressed training data and decompress it in the script or (better yet)
[ ] Enable decompression in the modules MinHash.py and Query.py themselves.

Mar 23 '20 21:03 dkoslicki