colord
colord copied to clipboard
Fixing maximum kmer count kmc flag
Dear CoLoRd developers team!
Hope you are doing well.
Issue:
While working with a CoLoRd in reference mode we have noticed strange behaviour in the case of length of reads << length of reference. Our investigations led to huge nodes in the similarity graph which are way more frequent than expected.
The only reason it usually works is this line: https://github.com/refresh-bio/colord/blob/25b28600d0716805beffab6941eb2c6b5f77014a/src/colord/reads_sim_graph.cpp#L390 But this condition is supposed to be true in the case of proper filtering.
We've noticed this in reference-based mode because that's no such condition to add a node to a graph from pseudo-reads.
Proposed Fix:
The fix is just switching to use the right flag of kmc tool, -cx
instead of -cs
. Here is a quotation from kmc help:
> -ci<value> - exclude k-mers occurring less than <value> times (default: 2)
> -cs<value> - maximal value of a counter (default: 255)
> -cx<value> - exclude k-mers occurring more of than <value> times (default: 1e9)
It is also supposed to fix the logic of compression since kmers are supposed to be chosen based on count as well as hash.
Testing:
We have performed thorough testing, including specific scenarios with short reads and long references. Feel free to test it yourself.
Acknowledgments:
A special thanks to @iam28th for their assistance in tracing back to the kmc flag.
Hope this improvement gonna be helpful and improve results.
Best regards, Alexey