AlignQC icon indicating copy to clipboard operation
AlignQC copied to clipboard

Homopolymer Errors

Open JeanMiCarter opened this issue 8 years ago • 2 comments

Hi Jason,

Would you mind clarifying how the program identifies homopolymers indels? For example how many bases are required either side of the indel to be classified? Many thanks!

Jean-Michel

JeanMiCarter avatar Mar 23 '17 16:03 JeanMiCarter

Hi Jean-Michel,

AlignQC identifies homopolymer indels from the alignment string in format such as

AAT-CCCGGTTC - Query AATTC-CGG--C - Reference

By decomposing the alignment into homopolymer blocks

AA T- CCC GG TT C - Query AA TT C-C GG -- C - Reference

Now for homopolymer blocks where a base is called in both query and reference but the counts differ, this is considered a homopolymer error. We can see a single homopolymer deletion in the first T base, and we can see a single homopolymer insertion in the first C base. Note however the second T base in the query has two T insertions, but this would not be called as a homopolymer insertion because it has no equivelent in the reference.

Another way to think of what I am calling a homopolymer indel would be that if both query and reference were homoplymer compressed:

ATCGTC - Query ATCG-C - Reference

any homopolymer indels would no longer show up in the alignment. since they have a 1:1 correspondence between the query and the reference.

I hope that's helpful. Thanks!

Jason

jason-weirather avatar Mar 23 '17 16:03 jason-weirather

Excellent, that's very helpful & informative. Thanks Jason!

JeanMiCarter avatar Mar 23 '17 17:03 JeanMiCarter