kevlar Filtering reads with ambiguous content

Filtering reads with ambiguous content

Open standage opened this issue 6 years ago • 0 comments

Our current handling of reads with ambiguous content is as follows.

For counting, kevlar uses khmer's default bulk loading behavior, which is to ignore all k-mers with ambiguous content. I think. Or it might actually not "handle" ambiguous characters at all, since MurmurHash will happily take any arbitrary input.
For finding novel k-mers, kevlar discards any reads with non [ACGT] characters.
Now that mate sequences are retained along with a novel read, no checks for ambiguous content are made on mate sequences at any step.

I'd suggest the following.

[ ] Write some tests to verify how reads/k-mers are handled in bulk loading.
[ ] Consider setting(s) that allow a user to specify a maximum number or proportion of ambiguous nucleotides in the read (or both), split on ambiguous nucleotides, and then look for interesting k-mers in the resulting fragments with length ≥ k.
[ ] Apply a similar setting (could be the same setting) to mate sequences: only retain mates that satisfy some count/proportion criteria for ambiguous nucleotides. We don't want to try to map reads with tons of Ns.

Feb 23 '18 17:02 standage