cdhit icon indicating copy to clipboard operation
cdhit copied to clipboard

Alignment identity calculation

Open Yuanyuancricle opened this issue 4 years ago • 1 comments

I have a question in regrad to how the squence indentity been calculated. Here are the two sequences with identical amino acids except a 42 aa inseretion in the longer version. There two sequence identity are cacultaed as 95% with BLASTP image while using cdhit with this setup as "cd-hit-v4.8.1-2019-0228/cd-hit -i OS_test.txt -o /home/yuan/Desktop/out_OS_test -c 0.8 -n 5" the identity is 85.11%. 0 915aa, >XP_015639005.1... * 1 873aa, >XP_025881246.1... at 85.11%

Based on the descritpion, using the default global alignment, the identity is calculated as: "number of identical amino acids in alignment divided by the full length of the shorter sequence". In this case would be 873/873 = 100%, but obviously , I understood wrongly. Can anyone help to clartiy how did the alignment and identity been calculated? Thank you so much!

Yuanyuancricle avatar Aug 16 '20 12:08 Yuanyuancricle

I'm also getting a really weird result for two proteins with >90% identity (see sequences.txt for fasta). Each protein has multiple copies of slightly varied chains and I think it trips up the alignment algorithm.

I played around with a debug build of cd-hit in gdb and managed to print out some alignments (see alignments.txt). The first block is with the cd-hit algorithm running normally; it shows an identity of ~47% and fails to print an alignment. The second block was generated using the full alphabet and with modifications to mat.gap = -10 and mat.gap_ext = -1; it shows an identity of ~52% and does print an alignment. The alignment has the proteins way off center from each other even though they are very similar. I think this is due to the repeated nature of the protein and the way the alignment algorithm works, but I could be wrong.

sequences.txt alignments.txt

spenceforce avatar Sep 25 '20 23:09 spenceforce