carveme
carveme copied to clipboard
CarveMe ignores highly similar genes during the building of GPRs
Using carveme v1.5.2.
When the input proteome contains 2 or more very similar sequences, each of them well aligned (bitscore > 100) by Diamond on the same BiGG_gene (model.gene), only 1 of them is retained for subsequent processing. This in my opinion is due to the following line contained in carveme/reconstruction/scoring.py :
gene2gene = annotation.query('score > 100') \
.sort_values(by='score', ascending=False) \
.groupby('BiGG_gene', as_index=False).apply(lambda x: x.iloc[0])
The x.iloc[0] selects just 1 of the input proteins per each BiGG_gene. Suppose the Diamond output for 2 of my input proteins is the following:
| query | target | bitscore | evalue | identity | ppos | qcovhsp | scovhsp |
|---|---|---|---|---|---|---|---|
| gene_26222 | iCN900.CD630_31350 | 288.0 | 2.620000e-97 | 47.7 | 70.7 | 96.0 | 98.3 |
| gene_27280 | iCN900.CD630_31350 | 283.0 | 1.240000e-95 | 47.9 | 70.5 | 99.7 | 98.3 |
gene_26222 will be retained for subsequent processing, while gene_27280 will be discarded (bitscore 288.0 wins over 283.0). The consequence is that, when the GPR of the underlying reaction is constructed , only gene_26222 will appear. In my opinion, it would be best to replace ‘gene_26222' in the GPR with '(gene_26222 or gene_27280)' .