htslib
htslib copied to clipboard
Does probaln_glocal() use the wrong emission probability for N?
While looking into samtools/samtools#712 I've discovered an input to probal_glocal() that makes it return a negative value. Unfortunately, the return value from probaln_glocal doesn't seem to be documented anywhere, but I'm assuming it's supposed to be >= 0 as the comment here suggests that it's a likelihood, and the code that uses the value looks like it would break if given something that's negative.
The input that gives the negative result is:
_ref="GCTTTTCATAGTTCTGATTACATGCACATTCTAAAAAATAGGCTATTACAGGATCTCAACCAGTTCATCATTAACAAGTCCCAGTTGCTTTGCTTCGCTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN"
_query="GCTTTTCATAGTTCTGATTACATGCACATTCTAAAAAATAGGCTATTACAGGATCTCAACCAGTTCATCATTAACAAGTCCCAGTTGCTTTGCTTCGCTG"
c={ 1e-4, 1e-2, 10 }
bw=119
My suspicion is that the emission probability for ref N, set to 1.0 at lines 116 and 130, is too high. This causes the best alignment of the query to be against all the Ns and not the sequence at the beginning of the reference. As an N reference base could emit any of A,C,G or T would 0.25 be a better score to use for this case? It certainly seems to make the function better behaved when trying to vary the number of N bases present at the end.
cc @lh3