cutadapt
cutadapt copied to clipboard
expect column in overview table does not take account of errors
Hi,
I'm currently working on visualising the output of Cutadapt in MultiQC, which basically entails using the summary table info to render a histogram. There is an existing module for this in MultiQC but I'm looking to improve it and thus trying to get a deep understanding of what the table tells me. (note - I've submitted a somewhat related PR #239 regarding the max.err/error counts columns.)
The documentation says that "The 'expect' column gives only a rough estimate of the number of sequences that is expected to match randomly". But for the default setting '-e 0.1' most of the numbers generated are an order of magnitude too low, as it does not take into account allowed mismatches. This is easily verified by setting "-e 0.0" and confirming the 'expected' numbers are the same. If you just supply a random sequence as your primer then you'll see that most of the hits are with one or more errors/mismatches. The 'expect' column will roughly correspond to the first value in the 'error counts' columns, not the sum total.
Relatedly, if you allow for one mismatch then this can include insertions and deletions. If you supply a 13bp adapter and the aligner matches it to the last 9 bases of the read with one deletion then the match gets recorded in the '9' row of the table. However, the substring of the adapter that gave rise to the match was the first ten bases. Likewise the same substring might match 11 bases if there is an insertion near the end of the read.
So to calculate realistic 'expect' numbers you'd not only have to account for matches with substitutions but also deal with the fact that the hit is recorded against the target length not the query length or the alignment length - ie. add the probability of a match with a deletion to the previous row and a match with an insertion to the next row in the table.
The reason I care about the last bit is that I'm plotting out count/expected in order to spot any unexpected peaks, and this produces an unexpected peak at any point where the max.err value changes between two rows. Now I understand what's happening I'll just ignore the top of the summary table but I wanted to report what I found. Even if this can't be fixed it should be documented, I think.
Cheers,
TIM
Your observation is correct that the "expect" column does take neither mismatches nor indels into account. The formula for the value in the row describing length l
is just n * 0.25 ** l
where n
is the number of reads. (To be precise, l
is also clipped to the adapter length, so it is actually n ** 0.25 ** min(l, adapterlength)
.) This is exactly why I chose the words "rough estimate" :-).
I have also not looked at this for very large datasets. When there is just a million reads, the numbers become zero very quickly anyway, even before you get into the part of the table where the number of allowed errors is 1 or more.
As you saw, the table right now combines two slightly different meanings of "length": adapter match length vs read match length (well, and even the read match length is a bit weird because it is actually the length of the removed sequence and thus includes trailing bases if there are any). I have been wondering for a while whether I should change it such that the table is split into two parts: One would give only the histogram of the length of removed sequences. The other would give the matched adapter lengths together with the expect, max.err and error counts values. That other table would just have as many rows as the adapter is long.
Taking mismatches and indels into account is certainly possible, but I think this would be a lower priority for me at the moment.