PairHMM
PairHMM copied to clipboard
Possible misalignment in PairHMM evaluation
I was studying your implementation as a reference, sticking to the SCALAR version (not being confident with SSE and AVX), but I encountered a possible inconsistency in the evaluation of the results (which may be present in the other implementations as well). I could not verify it myself, since I haven't the reference results.
Before applying the PairHMM algorithm to a read + haplotype, the haplotype is padded and reversed. Let MRL denote the maximum (original) read length in our batch of data. The "padded haplotype" consists of:
- heading zeroes, exactly MRL + 1
- reversed haplotype
- trailing zeroes, exactly MRL.
When executing the algorithm on a read of length MRL, you start comparing the first character of the haplotype and the first character of the read, which seems very reasonable. Another reasonable point is that exactly "read_length + haplotype_length - 1" diagonals have to be calculated FOR ANY READ. Since the haplotype is padded w.r.t. MRL, when evaluating PairHMM on shorter reads, you initialize the index "d" to "MRL -RL", where RL is the actual read original length. Namely: "for (auto d = fd; d != mrl + hl - 1; ++d) { ... }" However, the effect should be the following. Given that you scan the input read in sequence, namely: "for (auto r = 1u; r != rows; ++r) { ... }" and that the haplotype is indexed even through d, namely: "const auto hap_base = haplotype.bases[hap_offset+r-d];" you end up comparing the first character of a shorter read to a character of the haplotype which is not the first (it should be instead the "MRL - RL"th). Since you for sure evaluate the right number of diagonals, you compute some of them out of the boundaries of the haplotype/scoring matrix. This yields (possibly) wrong results.
I propose the following approach: "for (auto d = 0; d != rl + hl - 1; ++d) { ... }" in order to 1) align the read and haplotype to the first character and 2) evaluate the right number of diagonals. I obviously get different results for some of the samples, while the great part of them coincide.
Since I don't know exactly if I'm missing something from the algorithm, I just though to ask you.