GECKO icon indicating copy to clipboard operation
GECKO copied to clipboard

makeEcModel: model-specific kcat data in dedicated structure

Open edkerk opened this issue 3 years ago • 5 comments

Description of the new feature:

Instead of having kcat values only directly integrated as stoichiometric coefficients, keep a separate data structure with model-specific kcat values.

Each field in the model.kcat structure has the same length, where each entry corresponds to a unique kcat value, that can be assigned to a unique combination of reaction, metabolite and enzyme.

The purpose of this structure is that it can be populated from different sources, (e.g. DLKcat, BRENDA/SABIO-RK, manual curation) and changed at any time, while these values can then be "applied" to the ec-model to update the kcat values that are used.

Take particular notice:

  1. If a reaction has multiple co-substrates, there can be multiple entries, each with a different .metNames-associated kcat value [a later function might then choose the lowest, highest or mean value to use as kcat for that reaction in the ec-model].

  2. If a reaction has multiple genes associated (whether via AND or OR relationships), they can all have a separate entry. Downstream functions will again choose which of the provided kcat values will be used to populate the model.

Field Type Description
model.kcat.rxns string Reaction IDs gathered directly from model
model.kcat.eccodes string Gathered directly from model, can be further populated via e.g. Uniprot/KEGG?
model.kcat.metNames string Gathered directly from model, each substrate separatedly. If reaction is in reverse (.reverse=true), then product is shown
model.kcat.metSmiles string Parsed, DLKcat has code for this?
model.kcat.gene string Gathered directly from model, consider gene association (and/or rules)?
model.kcat.uniprot string Can be parsed via Uniprot dump with GECKO code? Perhaps not necessary
model.kcat.mw numeric Parsed from Uniprot/KEGG? Otherwise, predict from protein FASTA?
model.kcat.reverse logical Gathered directly from the model, reverse reactions will have double entries, of which half of them have reverse=true
model.kcat.kcat float Gathered from various sources, see later. One value for each rxns-substrate pair.
model.kcat.source numeric 0 = manual; 1 = brenda (or similar); 2 = kmax; 3 = DLKcat
model.kcat.notes string Some additional information? Could also be multiple fields. For instance some information on how well the fuzzy-matching went (was it species specific, substrate specific, etc?)
model.kcat.sequence string Amino acid sequence. To be used as input for DLKcat, probably good to have anyway.
  • [ ] Write initiateKcatStructure that prepares the above structure from an existing (non-ec) model structure
  • [ ] Other parts needs to be gathered from user-provided files (similar to current GECKO requirement to gather Uniprot data). At the moment do not attempt to automate this, clear instructions for manual download will suffice, as these data will only be gathered once for a model. Note that DLKcat can also gather some of this.
  • [ ] Move existing fields that GECKO currently adds to the model structure, to the model.kcat structure (MW, uniprot, etc.)

This new structure should also provide all necessary information to be able to run DLKcat. Maybe more DLKcat code can be repurposed?

How to populate this structure with kcat values is described in a separate Issue, this Issue is only about constructing the fields in the model.

I hereby confirm that I have:

  • [ ] Done this analysis in the master branch of the repository
  • [ ] Checked that a similar issue does not exist already

edkerk avatar May 25 '22 20:05 edkerk

This is now generated by makeEcModel in branch feat/makeEcModel (feedback on that function itself are welcome here: #161).

The structure changed a bit to what was proposed above. It is now called model.ec, instead of model.kcat.

Some fields are still "one for each kcat value", so multiple substrates, isoenzymes and kcat sources means multiple entries to the same reaction. However, to avoid unnecessary duplication, some other fields only have one entry each (protein and metabolite information). New scheme:

The following fields have one entry per kcat value. So if a reaction is reversible, have multiple substrates, multiple enzymes, kcats from multiple sources etc., each of the fields will have another entry.

Field Type Description
model.ec.rxns string Reaction IDs gathered directly from model (after expansion and making irreversible).
model.ec.enzyme string UniprotIDs by matching to genes in model.grRules
model.ec.subunits float Number of subunits that make up whole enzyme complex, 1 by default.
model.ec.substrate string Metabolite name of the substrate, as gathered from model.metNames.
model.ec.reverse logical Gathered directly from the model, reverse reactions will have double entries, of which half of them have reverse=true. Might be pointless, as the reacion IDs also indicate reversibility (_REV).
model.ec.kcat float Gathered from various sources, see later. One value for each rxn-substrate-enzyme (subunit) combination.
model.ec.source string Where kcat value is derived from, e.g. 'manual', 'dlkcat', 'kcatdb'. Previously suggeted to be numeric, might actually be a better solution.
model.ec.notes string Whatever notes the user wants to add
model.ec.eccodes string Gathered directly from model, can be further populated via e.g. Uniprot/KEGG? Only used for classical GECKO ec-matching.

The following fields are non-redundant and have only unique entries.

Field Type Description
model.ec.gene string Directly from model.genes, to indicate which uniprot IDs belong to which gene (see next field).
model.ec.uniprot string Uniprot IDs for each gene.
model.ec.mw string Molecular weight for each protein.
model.ec.sequence string Amino acid sequence for each protein.
model.ec.concs float Measured concentration of each protein.
model.ec.metNames string Directly from model.metNames, to indicate which metSmiles belong to which metabolite (see next field).
model.ec.metSmiles string SMILES for each metabolite.

edkerk avatar Jun 16 '22 10:06 edkerk

@edkerk fyi the ecModels container can run on any branch in this repository. Let me know if/when you'd like to run some tests (generate ecModels for the 5 organisms).

mihai-sysbio avatar Jun 16 '22 19:06 mihai-sysbio

This is now implemented with #166, but of course the scheme might be modified a bit during the ongoing refactoring.

edkerk avatar Jun 28 '22 09:06 edkerk

FYI, after our discussion, this scheme will be modified, to end up with one model.ec.kcat entry for each complex (which can have multiple subunits associated).

edkerk avatar Jul 01 '22 07:07 edkerk

Simplified scheme introduced with #167

  • One entry per reaction
    • Each entry contains 1 kcat value that is integrated in the model (when applyKcatConstraints is run)
    • No duplicate entries with alternative substrates, kcat sources or subunits of a complex.
  • rxnEnzMat sparse matrix indicates which enzymes are present for each reaction, with non-zero values indicating the copy number if it forms a complex.
Field Type Description
model.ec.rxns string Reaction IDs gathered directly from model (after expansion and making irreversible).
model.ec.rxnEnzMat sparse matrix Somewhat comparable to rxnGeneMat, but Enz refers to model.ec.enzymes entries, and this matrix can have any positive integer to indicate the number of subunits in a complex.
model.ec.kcat float Gathered from various sources, see later. One value for each rxn-substrate-enzyme (subunit) combination.
model.ec.source string Where kcat value is derived from, e.g. 'manual', 'dlkcat', 'kcatdb'. Previously suggeted to be numeric, might actually be a better solution.
model.ec.notes string Whatever notes the user wants to add
model.ec.eccodes string Gathered directly from model, can be further populated via e.g. Uniprot/KEGG? Only used for classical GECKO ec-matching.

The following fields are non-redundant and have only unique entries. Metabolite information is left out in comparison with earlier suggestion.

Field Type Description
model.ec.genes string Directly from model.genes, to indicate which uniprot IDs belong to which gene (see next field).
model.ec.enzymes string Uniprot IDs for each gene.
model.ec.mw string Molecular weight for each protein.
model.ec.sequence string Amino acid sequence for each protein.
model.ec.concs float Measured concentration of each protein.

edkerk avatar Jul 04 '22 22:07 edkerk

This is now fully functional.

edkerk avatar Feb 14 '23 22:02 edkerk