CMash icon indicating copy to clipboard operation
CMash copied to clipboard

Fast and accurate set similarity estimation via containment min hash

Results 14 CMash issues
Sort by recently updated
recently updated
newest added

Dear CMash team, I am wondering the CMash is a metric, meaning, is CMash (A, B) the same with CMash (B, A), or CMash (A, B) + CMash (B, C)...

Definitions: "new method" = use a very large k-mer size, put in ternary search trie, use prefix matches to infer smaller k-mer size containment values "old method" = train and...

priority

When running `StreamingQueryDNADatabase.py`, in reality, we need only the K-mers in the sample that exist in the training database sketches. As such, it's possible to: 1. Dump all the training...

enhancement
good first issue

Will need to: - [x] modularize the content of `MakeStreamingDNADatabase.py` - [x] create methods to make the TST the old way - [x] create method to make TST without reading...

enhancement
priority

Current tests are end-to-end integration tests that makes sure scripts execute successfully. There is much more testing that could be done including: - [x] adding unit tests to the `tests`...

good first issue

For example, using the [Metalign](https://github.com/nlapier2/Metalign) default training database (199807 genomes) and running ```bash python MakeStreamingDNADatabase.py ${trainingFiles} ${outputDir}/${cmashDatabase} -n ${numHashes} -k 60 -v python MakeStreamingPrefilter.py ${outputDir}/${cmashDatabase} ${outputDir}/${prefilterName} 30-60-10 ``` results in...

enhancement
good first issue

Basically, show an end-to-end way to: - [x] re-train CMash - [ ] add to an existing CMash database - [ ] retrain in a fashion suitable for [Metalign](https://github.com/nlapier2/Metalign) Work...

eg. if ``` python MakeStreamingDNADatabase.py ${trainingFiles} ${outputDir}/${cmashDatabase} -n ${numHashes} -k ${maxKsize} -v ``` and `${outputDir}` doesn't exist, then `MakeStreamingDNADatabase.py` should create it!

good first issue

Specifically: ``` # then normalize by the number of unique k-mers (to get the containment index) # In essence, this is the containment index, restricted to unique k-mers. This effectively...