AlignQC
AlignQC copied to clipboard
Homopolymer Errors
Hi Jason,
Would you mind clarifying how the program identifies homopolymers indels? For example how many bases are required either side of the indel to be classified? Many thanks!
Jean-Michel
Hi Jean-Michel,
AlignQC identifies homopolymer indels from the alignment string in format such as
AAT-CCCGGTTC - Query
AATTC-CGG--C - Reference
By decomposing the alignment into homopolymer blocks
AA T- CCC GG TT C - Query
AA TT C-C GG -- C - Reference
Now for homopolymer blocks where a base is called in both query and reference but the counts differ, this is considered a homopolymer error. We can see a single homopolymer deletion in the first T base, and we can see a single homopolymer insertion in the first C base. Note however the second T base in the query has two T insertions, but this would not be called as a homopolymer insertion because it has no equivelent in the reference.
Another way to think of what I am calling a homopolymer indel would be that if both query and reference were homoplymer compressed:
ATCGTC - Query
ATCG-C - Reference
any homopolymer indels would no longer show up in the alignment. since they have a 1:1 correspondence between the query and the reference.
I hope that's helpful. Thanks!
Jason
Excellent, that's very helpful & informative. Thanks Jason!