CMash
CMash copied to clipboard
Fast and accurate set similarity estimation via containment min hash
Dear CMash team, I am wondering the CMash is a metric, meaning, is CMash (A, B) the same with CMash (B, A), or CMash (A, B) + CMash (B, C)...
Definitions: "new method" = use a very large k-mer size, put in ternary search trie, use prefix matches to infer smaller k-mer size containment values "old method" = train and...
in yaml and docs
When running `StreamingQueryDNADatabase.py`, in reality, we need only the K-mers in the sample that exist in the training database sketches. As such, it's possible to: 1. Dump all the training...
Will need to: - [x] modularize the content of `MakeStreamingDNADatabase.py` - [x] create methods to make the TST the old way - [x] create method to make TST without reading...
Current tests are end-to-end integration tests that makes sure scripts execute successfully. There is much more testing that could be done including: - [x] adding unit tests to the `tests`...
For example, using the [Metalign](https://github.com/nlapier2/Metalign) default training database (199807 genomes) and running ```bash python MakeStreamingDNADatabase.py ${trainingFiles} ${outputDir}/${cmashDatabase} -n ${numHashes} -k 60 -v python MakeStreamingPrefilter.py ${outputDir}/${cmashDatabase} ${outputDir}/${prefilterName} 30-60-10 ``` results in...
Basically, show an end-to-end way to: - [x] re-train CMash - [ ] add to an existing CMash database - [ ] retrain in a fashion suitable for [Metalign](https://github.com/nlapier2/Metalign) Work...
eg. if ``` python MakeStreamingDNADatabase.py ${trainingFiles} ${outputDir}/${cmashDatabase} -n ${numHashes} -k ${maxKsize} -v ``` and `${outputDir}` doesn't exist, then `MakeStreamingDNADatabase.py` should create it!
Specifically: ``` # then normalize by the number of unique k-mers (to get the containment index) # In essence, this is the containment index, restricted to unique k-mers. This effectively...