kevlar
kevlar copied to clipboard
Filtering reads with ambiguous content
Our current handling of reads with ambiguous content is as follows.
- For counting, kevlar uses khmer's default bulk loading behavior, which is to ignore all k-mers with ambiguous content. I think. Or it might actually not "handle" ambiguous characters at all, since MurmurHash will happily take any arbitrary input.
- For finding novel k-mers, kevlar discards any reads with non
[ACGT]
characters. - Now that mate sequences are retained along with a novel read, no checks for ambiguous content are made on mate sequences at any step.
I'd suggest the following.
- [ ] Write some tests to verify how reads/k-mers are handled in bulk loading.
- [ ] Consider setting(s) that allow a user to specify a maximum number or proportion of ambiguous nucleotides in the read (or both), split on ambiguous nucleotides, and then look for interesting k-mers in the resulting fragments with length ≥ k.
- [ ] Apply a similar setting (could be the same setting) to mate sequences: only retain mates that satisfy some count/proportion criteria for ambiguous nucleotides. We don't want to try to map reads with tons of Ns.