GECKO icon indicating copy to clipboard operation
GECKO copied to clipboard

assign and use EC codes from `model.eccodes`

Open edkerk opened this issue 1 year ago • 3 comments

Generally, EC codes are reaction specific, and can therefore be defined for each reaction. GECKO1/2 uses EC numbers to parse BRENDA, but this is currently extracted from a UniProt file.

Instead, it would be ideal to use an eccodes field that is part of the model, for the reasons:

  • Transparency which EC code is used to find kcat values from BRENDA.
  • Allow manual curation, in case the specified EC code is incorrect.

model.eccodes also contains EC numbers, but it might be preferred to have a separate model.ec.eccodes field, so not to intefere too much with the original GEM (it might for instance be annotated to multiple EC codes, but this is not good for GECKO, see below), and having model.ec containing all information that is essential for the enzyme-constraint extension. However, this is also (potentially) duplicating information, so perhaps using model.eccodes is the best solution anyway.

Many models have no EC numbers annotated. There can be multiple ways to gather such information:

  • Based on reaction annotation. If a reaction has e.g. KEGG or MetaCyc identifier, these databases can be parsed to obtain the relevant EC code,
  • Based on protein annotations. In GECKO1/2 this is done using getEnzymeCodes. This could be somewhat modified to parse UniProt and/or KEGG to extract protein specific EC codes.

~One consideration is single EC codes should be defined for each reaction. E.g. UniProt can include multiple EC codes per enzyme, which can be explained by:~

  • ~It is an isozyme, able to catalyze the conversion of multiple different substrates, having multiple different cofactors, or even catalyzing completely different reactions. But each reaction should only have one EC code.~
  • ~One of the EC codes is less specific than the others. E.g. 1.1.1.2 is an alcohol dehydrogenase, while 1.1.1.21 is an alditol dehydrogenase (and alditol is an alcohol), while 1.1.1.14 is an iditol dehydrogenase (and iditol is an alditol). If the specific reaction involves iditol it should only be assigned to 1.1.1.14, but if the enzyme can also catalyze dehydrogenases of other alditols, they should be annotated with 1.1.1.21 instead.~

So, main points:

  • [ ] Use model.eccodes ~(and curate it to contain single EC numbers per reaction)~, or have a separate model.ec.eccodes field?
  • [ ] Have an addECcodes function that can parse EC numbers from different input (or have separate functions), including UniProt. Possibly repurposing getEnzymeCodes.
  • [ ] ~Have a check included to make sure that are only single EC codes per reaction. This is only relevant for finding kcat values from BRENDA, it is not required for e.g. DLKcat, so should not necessarily be always enforced.~

Edit: single ec-codes is likely not preferred, as having multiple (with decreasing substate specificity) can help to e.g. match alternative kcat values in BRENDA

edkerk avatar Sep 27 '22 13:09 edkerk

In GECKO 1/2, I remember that the maximum kcat among all multiple EC codes of a reaction is selected for the reaction. Is it possible to just follow this in GECKO 3?

Yu-sysbio avatar Sep 29 '22 10:09 Yu-sysbio

In my view, the addition and curation of EC codes is under total control of the modellers. Therefore, I would keep GECKO clear of interfering with the EC codes, especially thinking of the cases when models already come with their own. It would be very confusing for the modeller to have one definition of the EC codes, and for GECKO to completely sidestep that. In the case of none or multiple EC codes, it might be simpler just to report these as problems to the user, something along the lines of "GECKO 3 reuses the EC code assigned to a reaction. The following reactions have been ignored, because they have either none or multiple EC codes:". And then, when going the DLKcat way, I guess this warning will not need to be shown.

@Yu-sysbio I thought the decision was to support a single kcat per reaction, so the check for the maximum value would not be needed.

mihai-sysbio avatar Sep 29 '22 13:09 mihai-sysbio

Selecting maximum kcat from multiple EC models is possible, as the kcatList that is used by selectKcatValue can contain kcat values from multiple ec-codes (at the moment it does not have an ec-codes field, perhaps it should).

Note that the single-kcat-per-reaction is once the kcat is included in the model (model.ec.kcat), one can still suggest multiple kcats per reaction (in the kcatList that selectKcatValue uses to populate model.ec.kcat), as e.g. DLKcat also gives multiple kcat values per reaction (one for each substrate).

@mihai-sysbio Your suggestion seems to match having model.eccodes as source of EC codes only (no duplication in model.ec.eccodes). These are not relevant for DLKcat, so we can have a quick check if model.eccodes exists/contains any values if the modeller tries to query BRENDA for kcat values (= GECKO1/2-approach). We can have getEnzymeCodes as a suggested way to populate these ec-numbers, but it should be up to the modeller to curate this.

edkerk avatar Sep 29 '22 13:09 edkerk

This is only required for the GECKO1&2 legacy fuzzy kcat matching. We'll just continue using this approach, refactored for the GECKO3 model format, as implemented in #188.

edkerk avatar Dec 21 '22 21:12 edkerk