CMash issues

Is CMash metric?

4

Dear CMash team, I am wondering the CMash is a metric, meaning, is CMash (A, B) the same with CMash (B, A), or CMash (A, B) + CMash (B, C)...

jianshu93

Multiple k-mer sizes confirmation and testing

2

Definitions: "new method" = use a very large k-mer size, put in ternary search trie, use prefix matches to infer smaller k-mer size containment values "old method" = train and...

dkoslicki

priority

Improved classification time with KMC

16

When running `StreamingQueryDNADatabase.py`, in reality, we need only the K-mers in the sample that exist in the training database sketches. As such, it's possible to: 1. Dump all the training...

dkoslicki

enhancement

good first issue

For very large databases, creation of TST is slow and memory intensive

6

Will need to: - [x] modularize the content of `MakeStreamingDNADatabase.py` - [x] create methods to make the TST the old way - [x] create method to make TST without reading...

dkoslicki

enhancement

priority

Testing environment

9

Current tests are end-to-end integration tests that makes sure scripts execute successfully. There is much more testing that could be done including: - [x] adding unit tests to the `tests`...

dkoslicki

good first issue

Gzipping all training files results in a nice reduction: add feature that allows scripts/modules to handle this

For example, using the [Metalign](https://github.com/nlapier2/Metalign) default training database (199807 genomes) and running ```bash python MakeStreamingDNADatabase.py ${trainingFiles} ${outputDir}/${cmashDatabase} -n ${numHashes} -k 60 -v python MakeStreamingPrefilter.py ${outputDir}/${cmashDatabase} ${outputDir}/${prefilterName} 30-60-10 ``` results in...

dkoslicki

enhancement

good first issue

Create script to demonstrate how to re-train CMash

Basically, show an end-to-end way to: - [x] re-train CMash - [ ] add to an existing CMash database - [ ] retrain in a fashion suitable for [Metalign](https://github.com/nlapier2/Metalign) Work...

dkoslicki

In MakeStreamingDNADatabase.py, don't require output directory to exist

eg. if ``` python MakeStreamingDNADatabase.py ${trainingFiles} ${outputDir}/${cmashDatabase} -n ${numHashes} -k ${maxKsize} -v ``` and `${outputDir}` doesn't exist, then `MakeStreamingDNADatabase.py` should create it!

dkoslicki

good first issue

In post-processing, find correct denominator

Specifically: ``` # then normalize by the number of unique k-mers (to get the containment index) # In essence, this is the containment index, restricted to unique k-mers. This effectively...

dkoslicki

CMash
CMash copied to clipboard

Metadata

Is CMash metric?

Multiple k-mer sizes confirmation and testing

Add KMC as requirement

Improved classification time with KMC

For very large databases, creation of TST is slow and memory intensive

Testing environment

Gzipping all training files results in a nice reduction: add feature that allows scripts/modules to handle this

Create script to demonstrate how to re-train CMash

In MakeStreamingDNADatabase.py, don't require output directory to exist

In post-processing, find correct denominator

← Metadata

Owner

Metadata

CMash CMash copied to clipboard

Metadata

← Metadata

Owner

Metadata

CMash
CMash copied to clipboard