khmer icon indicating copy to clipboard operation
khmer copied to clipboard

More thoughts on non-ACGT characters.

Open ctb opened this issue 7 years ago • 21 comments

Spurred by #1540, I'm converging on two principles regarding non-ACGT:

  • we shouldn't ever silently ignore an entire sequence record, because that leads to surprisingly inconsistent behavior between short reads and genomic sequences, as in the above where an entire genome vanishes from the k-mer counts. So we should either raise an exception OR automatically convert.
  • automatic conversion to As or something is probably the right default behavior, unless --check-sequence or --pedantic is specified. In part this is because even NCBI records have non-ACGT IUPAC bases in them.

There is a fun cautionary tale in this twitter exchange that is relevant, I think; we want to get the k-mer breakdown at least approximately correct :). This is another reason I think automatic conversion is appropriate - if the sequence type is completely wrong (e.g. protein, or RNA, or text) then the k-mer counts will be visibly off, but this is rare compared to the situation where you are looking at mostly legit DNA sequence and there's an S in the middle of a 5 Mbp genome. (I used to think we should automatically convert until some threshold was reached, like "1% of your sequence is bad... STOP", but that's overly clever and will no doubt bite us.

Last but not least, we should include a script (in khmer? screed?) that does a breakdown of the non-ACGT content of your data set so that people can debug more easily.

ctb avatar Nov 24 '16 15:11 ctb

How do the non-ACGT letters get into the data in the first place?

betatim avatar Nov 25 '16 13:11 betatim

Various intentional and unintentional bits of cleverness. Some software uses IUPAC codes to indicate observed ambiguity, for example.

ctb avatar Nov 25 '16 18:11 ctb

So there is the case where the user is probably not doing what they think they are doing (feeding not DNA sequence into khmer) and the cleverness case. For the latter, should we be clever and sample a replacement from the possibilities (e.g. replace S with random.sample(['G', 'C']))? Or even add one kmer with each option? Or is that (trying to be clever)**2?

betatim avatar Nov 25 '16 19:11 betatim

I think that's trying to be too clever ;).

ctb avatar Nov 25 '16 19:11 ctb

I wonder if we can look at the beginning of the file and do some simple statistics (on, say, the first 100kb of sequence) and then barf if it looks bad?

This might be a good use for issues with bad pairing as well - scripts that are told to expect pairing info and don't get any pairs could barf.

After the first 100kb we could then assume that all the data is "good format" and go from there. Will probably catch 90%+ of problems...

ctb avatar Dec 19 '16 14:12 ctb

Another option is to simply ignore any k-mers containing a non-ACGT character. I don't know how this will affect performance of processing short reads, but it's essentially what I do now for long genomic sequences (by splitting chromosome sequences by [^ACGT] in preprocessing).

standage avatar Dec 21 '16 05:12 standage

@standage, that could solve counting/graph issues, but we should maybe still warn the user if they submit a grossly incorrect file.

more generally, we have a few different things we do with input data:

  1. load it into count tables and graphs
  2. trim reads
  3. discard reads

ignoring non-ACGT works for 1, and maybe 3, but not necessarily for 2.

ugh, bioinformatics :)

ctb avatar Dec 22 '16 14:12 ctb

Spurred by this comment and previous --

DNA validity checking is done in the following places:

  • #1435 and #1491 collectively provided functionality to check ACTG when using screed/ReadParser bulk loading code. This is what all scripts should be using (the cleaned_seq attribute).
  • is_valid_dna is called in check_and_* functions, used in consume_fasta*, abundance_distribution, trim_below_abundance, trim_on_abundance, find_spectral_error_positions, trim_on_stoptags, and HLLCounter::consume_string. I think most of this is now redundant with ReadParser. The trim* and find* calls should be able to be removed; and the HLLCounter stuff is a problem and the source of a bug (#1540).
  • Following through on #1470 to remove ThreadedSequenceProcessor would simplify the overall discussion.

To my surprise, the CPython function Hashtable::consume does no checking...

Some comments:

  • One of the goals of #1466 was to simplify a lot of this; where are we with that, @betatim?
  • To reiterate, it seems likely that much of the check_and* code can be removed which probably deals with @betatim's emotive complaint in #1511 ;).
  • @standage in kevlar should do his own danged preprocessing if he doesn't want to use our bulk loading code ;).

My biggest concern is how to figure out (for ourselves), test, and explain (to others) what we're actually doing with sequence preprocessing. One thought is to provide a function purely for debugging that takes an input file, runs it through our sequence processing code, and then outputs it with those changes; then we can write tests against that behavior as well. May be pointless but may also be friendly.

If someone were to want to get started on removing the check_and* code a good start would be to just remove it from the various functions and see if the md5sum tests pass - that would show that it's not actually needed any more (which is my suspicion).

ctb avatar Jan 24 '17 15:01 ctb

Started removing some code in #1590; it seems likely that the check_and* code can be removed from consume_fasta too.

ctb avatar Jan 24 '17 15:01 ctb

...as indeed it can. Anyway, #1590.

ctb avatar Jan 24 '17 15:01 ctb

@standage in kevlar should do his own danged preprocessing if he doesn't want to use our bulk loading code ;)

:)

I use the bulk loading code for...loading count tables/graphs, but I iterate over screed records and k-mers to do abundance checking (which is where I ran into issues). I added some temporary basic preprocessing to kevlar last night, but if khmer's get/get_count and hash/hash_dna functions are explicitly not going to support non ACGT characters, there's no problem making these changes to kevlar permanent.

standage avatar Jan 24 '17 17:01 standage

I use the bulk loading code for...loading count tables/graphs, but I iterate over screed records and k-mers to do abundance checking (which is where I ran into issues). I added some temporary basic preprocessing to kevlar last night, but if khmer's get/get_count and hash/hash_dna functions are explicitly not going to support non ACGT characters, there's no problem making these changes to kevlar permanent.

yep.

ctb avatar Jan 24 '17 17:01 ctb

I'm adding some tests to kevlar to verify assumptions regarding non-ACGT are met. I'm getting the following error.

>>> import khmer
>>> ct = khmer.Counttable(15, 1e6, 3)
>>> ct.consume('TTGACTTGACTTCAG')
1
>>> ct.consume('TTGACTTGACTTCAG')
1
>>> assert ct.get('TTGACTTGACTTCAG') > 0
>>>
>>> ct.consume('TTCAGNNNNNTTCAG')
0
>>> try:
...     _ = ct.get('TTCAGNNNNNTTCAG')
... except e:
...     print(str(e))
...
libc++abi.dylib: terminating with uncaught exception of type khmer::khmer_exception: Invalid base in read
Abort trap: 6

It looks like we need to make sure C++ exceptions get handed off to Python so that they can be handled at the Python level.

standage avatar Jan 24 '17 22:01 standage

More thoughts: I think one reason to settle on not doing anything to invalid DNA is that we are seeing a plethora of use cases for different approaches: ignoring non-ACTG, splitting on non-ACGT, replacing non-ACGT with N. So that's good not to dictate internally. But we can provide fast defaults with bulk loading, which is good, I think.

ctb avatar Jan 26 '17 21:01 ctb

Yeah, it we want performance AND flexibility we need to implement the different use cases at the C++ level. Using a performant and reasonable default for bulk loading but providing different (maybe slower) options at the Python level is a fine compromise for now I think.

For instance, I'm mostly interested in the skipping / splitting approach for long genomic sequences, which are typically much smaller in total volume than Fastq/BAM files full of millions of reads. Sequence handling in Python is fine for the former, but I want to avoid it when possible for the latter.

standage avatar Jan 27 '17 00:01 standage

Why not provide some standard/frequently used strategies as options for methods that consume a sequence or a file?

consume_fasta(fname, stragety=None) which continues to do what is currently the default, and then provide 'split', 'error', 'ignore', etc a little bit like the unicode en/decoding in python.

betatim avatar Jan 27 '17 16:01 betatim

yes, I think we're heading in this direction.

ctb avatar Jan 27 '17 16:01 ctb

If methods that accept single kmers assume that you know what you are doing then you can implement any non-standard processing in python and pass in the kmers.

Does that work for you @standage ?

betatim avatar Jan 27 '17 16:01 betatim

👍

standage avatar Jan 27 '17 18:01 standage

Three things left before we can close this issue:

  • #1590 should be merged;
  • behavior laid out in #1590 should be documented;
  • write up this comment as an issue

ctb avatar Mar 20 '17 14:03 ctb

Seems that the way sourmash handled it is correct.

https://github.com/dib-lab/sourmash/issues/137

swamidass avatar Nov 02 '20 04:11 swamidass