Roary icon indicating copy to clipboard operation
Roary copied to clipboard

Considerations on the Blastp hits percentage identity figure in Rplots.pdf

Open Sebdu132 opened this issue 5 years ago • 3 comments

I find that the "Number of blastp hits with different percentage identity" presented in p.5 of the output file Rplots.pdf could be a simple useful tool to estimate the blastp threshold to use to consider genes as othologous, or just to check our dataset

For instance, I tried to first run roary with a subsample of 5-10 genomes and the option -i 60 to get all-to-all blastp hit range from 60% to 100%. We expect that we will have a peak at the similarity value corresponding to the average aa distance between genomes (100% for very closely related strains, more around 97-98% at the species level, and less than 95% at the genus level). It could also show whether the selected genomes belong to 2 or more species : several peak should appear corresponding to intra- and inter-species comparisons.

However, the signal is blurred by the comparisons of CDS against themselves, which leads to a huge number of 100% identity compressing the other values. I modified the Roary/lib/Bio/Roary/Output/BlastIdentityFrequency.pm script in order to remove these spurious 100% id values by replacing 'awk '{print $3}' by 'awk '{if($1 != $2){print $3}}' on line 40. I think this modification could be included in the next release, or at least an explanation for roary users that the 100% id matches actually include self-CDS comparisons.

Next, something else appear in this graph (with -i at 60%): the number of blastp hits remains very low until ~90-92% id, then it increases +/- exponentially until 97%, then it drops for 98 and 99%, then it increases again for 100%. This pattern happens for various sets of species from the same genus, and even between species.

I don't have a clear explanation for such drop after 97% id whatever the genomes compared, but I suspect some algorithm (blastp?) artifact. Does anyone have a clue about it ? And does this may impact roary clustering in some way ?

I'm sorry for this long post, but I thought it could be iinteresting to share.

Sebastien.

Sebdu132 avatar Mar 22 '19 16:03 Sebdu132

After going into the algorithm in more details, the drop of 98 and 99% blastp hit is completely normal, because the iterative cd-hit step combines proteins having these similarities into one representative sequence. I just don't understand in this case why the number increases again for 100% identity, since the first round of cd-hit should have clustered them too. Any explanation?

Sebdu132 avatar Mar 26 '19 17:03 Sebdu132

Thank you Sebdu132 for this very interesting explanation. I studied the same plot wondering about the high number of 100% matches. I will implement this in my code to compare the results. Unfortunately the suggestion to replace by 'awk '{print $3}' by 'awk '{if($1 != $2){print $3}}' does not work but produces an error of the roary run.

mweberr avatar Mar 23 '20 13:03 mweberr

Regarding this topic... From what I understood from the description of the method, it seems that roary does not uses the reciprocal best hit approach but instead uses a blastP identity cutoff of 95% default. I am wondering if this approach is accurate enough as no other orthologues clustering programs seem to group homologues genes based on blastP identity? I used roary to compare two closely related species and as expected, the clustering was highly dependent on this blastP cutoff. With 90% cut-off, I still had a couple of clusters with the same gene name, because the sequence divergence must have been to high. Something like that would not have happened with the reciprocal best hit? In addition, the authors do not advice to lower the blastP identity below 90% but perhaps this is still ok when you are working with two different (closely) related species?

mdiricks avatar Oct 27 '20 14:10 mdiricks