Roary icon indicating copy to clipboard operation
Roary copied to clipboard

Roary grouping nearby dissimilar genes into the same group

Open cwbcm opened this issue 4 years ago • 1 comments

Hi, I am trying to build the pan genome for ~200 environmental E. coli isolates. MG1655 was included in the data set for reference. I found Roary has a tendency to cluster near by genes into the same group. This is probably due to the use of syntany information. But the problem is a lot of the time those gene are very dissimilar to each other.

You can see here in the gene_presence_absence.csv file, for MG1655(U00096.3), JKMANJED_00365 and JKMANJED_00366 are put in the tauC group. Screenshot from 2019-11-22 15-02-20

But the gff file from prokka showes that JKMANJED_00366 is annotated as tauD. Screenshot from 2019-11-22 15-22-43

And their sequence alignment is very different from each other. The minimum percentage identity for blastp was set at 90% for Roary, so it should be able to split these two entries. Screenshot from 2019-11-22 15-08-43

In addition, the tauC/tauD pair are clearly separated in most of other strains (only showing 2 of them here). Screenshot from 2019-11-22 15-57-39

The tauD gene in MG1655(JKMANJED_00366) and ERR3062275 (DIFMKJME_03605) align very well with 96.8% identity. Screenshot from 2019-11-22 15-56-45

It is also interesting that if only use 10 samples + MG1655, tauD and tauC are separated by the software. Is there any way to fix this issue? Or any ideas on what might cause this problem? I really appreciate the help.

cwbcm avatar Nov 22 '19 22:11 cwbcm

I have met the same problem T_T Is it possible to know what the cause of this is.

actledge avatar May 30 '22 02:05 actledge