carveme
carveme copied to clipboard
CarveMe doesn’t satisfy the original protein complex definitions during the building of GPRs
Using carveme v1.5.2.
My opinion reading your code is that there is no real check whether a particular protein complex, as it is defined in a particular model for a particular reaction, is really satisfied or not. This in my opinion can lead to wrong GPRs.
I report here below your code contained in carveme/reconstruction/scoring.py , converting the table 'gene_scores' in 'protein_scores', and finally converting 'protein_scores' in 'reaction_scores'. Then I try to give a practical example.
From gene to protein scores:
def merge_subunits(genes):
genes = genes.dropna()
if len(genes) == 0:
return None
else:
protein = ' and '.join(sorted(genes))
if len(genes) > 1:
return '(' + protein + ')'
else:
return protein
def merge_subunit_scores(scores):
return scores.fillna(0).mean()
protein_scores = gene_scores.groupby(['protein', 'reaction', 'model'], as_index=False) \
.agg({'query_gene': merge_subunits, 'score': merge_subunit_scores})
protein_scores.rename(columns={'query_gene': 'GPR'}, inplace=True)
From protein to reaction scores:
def merge_proteins(proteins):
proteins = set(proteins.dropna())
if not proteins:
return None
gpr_str = ' or '.join(sorted(proteins))
if len(proteins) > 1:
return '(' + gpr_str + ')'
else:
return gpr_str
def merge_protein_scores(scores):
return scores.max(skipna=True)
reaction_scores = protein_scores.groupby(['reaction'], as_index=False) \
.agg({'GPR': merge_proteins, 'score': merge_protein_scores}).dropna()
avg_score = reaction_scores.query('score > 0')['score'].median()
reaction_scores['normalized_score'] = (reaction_scores['score'] / avg_score).apply(lambda x: round(x, 1))
Suppose now this is my 'gene_scores' table:
| query_gene | BiGG_gene | score | gene | protein | reaction | model |
|---|---|---|---|---|---|---|
| NaN | STM_v1_0.STM0970 | NaN | G_STM0970 | P_STM0970+STM0973 | R_PFL | STM_v1_0 |
| gene_7500 | STM_v1_0.STM0843 | 784.0 | G_STM0843 | P_STM0843+STM0844 | R_PFL | STM_v1_0 |
| gene_18818 | STM_v1_0.STM0973 | 864.0 | G_STM0973 | P_STM0970+STM0973 | R_PFL | STM_v1_0 |
| NaN | STM_v1_0.STM0970 | NaN | G_STM0970 | P_STM0970+STM3241 | R_PFL | STM_v1_0 |
| gene_18818 | STM_v1_0.STM3241 | 846.0 | G_STM3241 | P_STM0970+STM3241 | R_PFL | STM_v1_0 |
| NaN | STM_v1_0.STM4114 | NaN | G_STM4114 | P_STM4114+STM4115 | R_PFL | STM_v1_0 |
| NaN | STM_v1_0.STM4115 | NaN | G_STM4115 | P_STM4114+STM4115 | R_PFL | STM_v1_0 |
| gene_13306 | STM_v1_0.STM0844 | 156.0 | G_STM0844 | P_STM0843+STM0844 | R_PFL | STM_v1_0 |
Applying the above code would result in this 'protein_score' table:
| protein | reaction | model | GPR | score |
|---|---|---|---|---|
| P_STM0843+STM0844 | R_PFL | STM_v1_0 | (gene_13306 and gene_7500) | 470.0 |
| P_STM0970+STM0973 | R_PFL | STM_v1_0 | gene_18818 | 432.0 |
| P_STM0970+STM3241 | R_PFL | STM_v1_0 | gene_18818 | 423.0 |
| P_STM4114+STM4115 | R_PFL | STM_v1_0 | None | 0.0 |
And finally in this 'reaction_score' table:
| reaction | GPR | score | normalized_score |
|---|---|---|---|
| R_PFL | ((gene_13306 and gene_7500) or gene_18818) | 470.0 | 1.0 |
As you can read , the final GPR is ((gene_13306 and gene_7500) or gene_18818).
The member “(gene_13306 and gene_7500)” is correct, as the both the components of the complex P_STM0843+STM0844 were well matched by Diamond.
Instead, the member “gene_18818” should not appear in the final GPR in my opinion, because neither the complex P_STM0970+STM0973 nor the complex P_STM0970+STM3241 were satisfied, since Diamond didn’t really catch the gene STM_v1_0.STM0970 (score = Nan).
This simplified example pretend to work with a 'gene_scores' table having just 1 model with 1 reaction, anyway I think it is enough to highlight the issue. With a real-life 'gene_scores' table, there is the chance that other models containing the same reaction can make the final GPR actually correct, balancing the errors. But it is just a chance...
Maybe it could be useful to have a dedicated option, like carve --strictgpr, if the user wants the original protein complex definitions to be strictly satisfied.