cdhit
cdhit copied to clipboard
Alignment identity calculation
I have a question in regrad to how the squence indentity been calculated.
Here are the two sequences with identical amino acids except a 42 aa inseretion in the longer version. There two sequence identity are cacultaed as 95% with BLASTP
while using cdhit with this setup as "cd-hit-v4.8.1-2019-0228/cd-hit -i OS_test.txt -o /home/yuan/Desktop/out_OS_test -c 0.8 -n 5" the identity is 85.11%.
0 915aa, >XP_015639005.1... *
1 873aa, >XP_025881246.1... at 85.11%
Based on the descritpion, using the default global alignment, the identity is calculated as: "number of identical amino acids in alignment divided by the full length of the shorter sequence". In this case would be 873/873 = 100%, but obviously , I understood wrongly. Can anyone help to clartiy how did the alignment and identity been calculated? Thank you so much!
I'm also getting a really weird result for two proteins with >90% identity (see sequences.txt for fasta). Each protein has multiple copies of slightly varied chains and I think it trips up the alignment algorithm.
I played around with a debug build of cd-hit
in gdb and managed to print out some alignments (see alignments.txt). The first block is with the cd-hit
algorithm running normally; it shows an identity of ~47% and fails to print an alignment. The second block was generated using the full alphabet and with modifications to mat.gap = -10
and mat.gap_ext = -1
; it shows an identity of ~52% and does print an alignment. The alignment has the proteins way off center from each other even though they are very similar. I think this is due to the repeated nature of the protein and the way the alignment algorithm works, but I could be wrong.