khmer
khmer copied to clipboard
More thoughts on non-ACGT characters.
Spurred by #1540, I'm converging on two principles regarding non-ACGT:
- we shouldn't ever silently ignore an entire sequence record, because that leads to surprisingly inconsistent behavior between short reads and genomic sequences, as in the above where an entire genome vanishes from the k-mer counts. So we should either raise an exception OR automatically convert.
- automatic conversion to As or something is probably the right default behavior, unless
--check-sequence
or--pedantic
is specified. In part this is because even NCBI records have non-ACGT IUPAC bases in them.
There is a fun cautionary tale in this twitter exchange that is relevant, I think; we want to get the k-mer breakdown at least approximately correct :). This is another reason I think automatic conversion is appropriate - if the sequence type is completely wrong (e.g. protein, or RNA, or text) then the k-mer counts will be visibly off, but this is rare compared to the situation where you are looking at mostly legit DNA sequence and there's an S in the middle of a 5 Mbp genome. (I used to think we should automatically convert until some threshold was reached, like "1% of your sequence is bad... STOP", but that's overly clever and will no doubt bite us.
Last but not least, we should include a script (in khmer? screed?) that does a breakdown of the non-ACGT content of your data set so that people can debug more easily.
How do the non-ACGT letters get into the data in the first place?
Various intentional and unintentional bits of cleverness. Some software uses IUPAC codes to indicate observed ambiguity, for example.
So there is the case where the user is probably not doing what they think they are doing (feeding not DNA sequence into khmer) and the cleverness case. For the latter, should we be clever and sample a replacement from the possibilities (e.g. replace S with random.sample(['G', 'C'])
)? Or even add one kmer with each option? Or is that (trying to be clever)**2?
I think that's trying to be too clever ;).
I wonder if we can look at the beginning of the file and do some simple statistics (on, say, the first 100kb of sequence) and then barf if it looks bad?
This might be a good use for issues with bad pairing as well - scripts that are told to expect pairing info and don't get any pairs could barf.
After the first 100kb we could then assume that all the data is "good format" and go from there. Will probably catch 90%+ of problems...
Another option is to simply ignore any k-mers containing a non-ACGT character. I don't know how this will affect performance of processing short reads, but it's essentially what I do now for long genomic sequences (by splitting chromosome sequences by [^ACGT] in preprocessing).
@standage, that could solve counting/graph issues, but we should maybe still warn the user if they submit a grossly incorrect file.
more generally, we have a few different things we do with input data:
- load it into count tables and graphs
- trim reads
- discard reads
ignoring non-ACGT works for 1, and maybe 3, but not necessarily for 2.
ugh, bioinformatics :)
Spurred by this comment and previous --
DNA validity checking is done in the following places:
- #1435 and #1491 collectively provided functionality to check ACTG when using screed/ReadParser bulk loading code. This is what all scripts should be using (the
cleaned_seq
attribute). -
is_valid_dna
is called incheck_and_*
functions, used inconsume_fasta*
,abundance_distribution
,trim_below_abundance
,trim_on_abundance
,find_spectral_error_positions
,trim_on_stoptags
, andHLLCounter::consume_string
. I think most of this is now redundant with ReadParser. Thetrim*
andfind*
calls should be able to be removed; and the HLLCounter stuff is a problem and the source of a bug (#1540). - Following through on #1470 to remove ThreadedSequenceProcessor would simplify the overall discussion.
To my surprise, the CPython function Hashtable::consume
does no checking...
Some comments:
- One of the goals of #1466 was to simplify a lot of this; where are we with that, @betatim?
- To reiterate, it seems likely that much of the
check_and*
code can be removed which probably deals with @betatim's emotive complaint in #1511 ;). - @standage in kevlar should do his own danged preprocessing if he doesn't want to use our bulk loading code ;).
My biggest concern is how to figure out (for ourselves), test, and explain (to others) what we're actually doing with sequence preprocessing. One thought is to provide a function purely for debugging that takes an input file, runs it through our sequence processing code, and then outputs it with those changes; then we can write tests against that behavior as well. May be pointless but may also be friendly.
If someone were to want to get started on removing the check_and*
code a good start would be to just remove it from the various functions and see if the md5sum tests pass - that would show that it's not actually needed any more (which is my suspicion).
Started removing some code in #1590; it seems likely that the check_and*
code can be removed from consume_fasta
too.
...as indeed it can. Anyway, #1590.
@standage in kevlar should do his own danged preprocessing if he doesn't want to use our bulk loading code ;)
:)
I use the bulk loading code for...loading count tables/graphs, but I iterate over screed records and k-mers to do abundance checking (which is where I ran into issues). I added some temporary basic preprocessing to kevlar last night, but if khmer's get/get_count
and hash/hash_dna
functions are explicitly not going to support non ACGT characters, there's no problem making these changes to kevlar permanent.
I use the bulk loading code for...loading count tables/graphs, but I iterate over screed records and k-mers to do abundance checking (which is where I ran into issues). I added some temporary basic preprocessing to kevlar last night, but if khmer's
get/get_count
andhash/hash_dna
functions are explicitly not going to support non ACGT characters, there's no problem making these changes to kevlar permanent.
yep.
I'm adding some tests to kevlar to verify assumptions regarding non-ACGT are met. I'm getting the following error.
>>> import khmer
>>> ct = khmer.Counttable(15, 1e6, 3)
>>> ct.consume('TTGACTTGACTTCAG')
1
>>> ct.consume('TTGACTTGACTTCAG')
1
>>> assert ct.get('TTGACTTGACTTCAG') > 0
>>>
>>> ct.consume('TTCAGNNNNNTTCAG')
0
>>> try:
... _ = ct.get('TTCAGNNNNNTTCAG')
... except e:
... print(str(e))
...
libc++abi.dylib: terminating with uncaught exception of type khmer::khmer_exception: Invalid base in read
Abort trap: 6
It looks like we need to make sure C++ exceptions get handed off to Python so that they can be handled at the Python level.
More thoughts: I think one reason to settle on not doing anything to invalid DNA is that we are seeing a plethora of use cases for different approaches: ignoring non-ACTG, splitting on non-ACGT, replacing non-ACGT with N. So that's good not to dictate internally. But we can provide fast defaults with bulk loading, which is good, I think.
Yeah, it we want performance AND flexibility we need to implement the different use cases at the C++ level. Using a performant and reasonable default for bulk loading but providing different (maybe slower) options at the Python level is a fine compromise for now I think.
For instance, I'm mostly interested in the skipping / splitting approach for long genomic sequences, which are typically much smaller in total volume than Fastq/BAM files full of millions of reads. Sequence handling in Python is fine for the former, but I want to avoid it when possible for the latter.
Why not provide some standard/frequently used strategies as options for methods that consume a sequence or a file?
consume_fasta(fname, stragety=None)
which continues to do what is currently the default, and then provide 'split'
, 'error'
, 'ignore'
, etc a little bit like the unicode en/decoding in python.
yes, I think we're heading in this direction.
If methods that accept single kmers assume that you know what you are doing then you can implement any non-standard processing in python and pass in the kmers.
Does that work for you @standage ?
👍
Three things left before we can close this issue:
- #1590 should be merged;
- behavior laid out in #1590 should be documented;
- write up this comment as an issue
Seems that the way sourmash handled it is correct.
https://github.com/dib-lab/sourmash/issues/137