GECKO makeEcModel: model-specific kcat data in dedicated structure

Description of the new feature:

Instead of having k_cat values only directly integrated as stoichiometric coefficients, keep a separate data structure with model-specific k_cat values.

Each field in the model.kcat structure has the same length, where each entry corresponds to a unique k_cat value, that can be assigned to a unique combination of reaction, metabolite and enzyme.

The purpose of this structure is that it can be populated from different sources, (e.g. DLKcat, BRENDA/SABIO-RK, manual curation) and changed at any time, while these values can then be "applied" to the ec-model to update the k_cat values that are used.

Take particular notice:

If a reaction has multiple co-substrates, there can be multiple entries, each with a different .metNames-associated k_cat value [a later function might then choose the lowest, highest or mean value to use as k_cat for that reaction in the ec-model].
If a reaction has multiple genes associated (whether via AND or OR relationships), they can all have a separate entry. Downstream functions will again choose which of the provided k_cat values will be used to populate the model.

Field	Type	Description
`model.kcat.rxns`	string	Reaction IDs gathered directly from model
`model.kcat.eccodes`	string	Gathered directly from model, can be further populated via e.g. Uniprot/KEGG?
`model.kcat.metNames`	string	Gathered directly from model, each substrate separatedly. If reaction is in reverse (.reverse=true), then product is shown
`model.kcat.metSmiles`	string	Parsed, DLKcat has code for this?
`model.kcat.gene`	string	Gathered directly from model, consider gene association (and/or rules)?
`model.kcat.uniprot`	string	Can be parsed via Uniprot dump with GECKO code? Perhaps not necessary
`model.kcat.mw`	numeric	Parsed from Uniprot/KEGG? Otherwise, predict from protein FASTA?
`model.kcat.reverse`	logical	Gathered directly from the model, reverse reactions will have double entries, of which half of them have reverse=true
`model.kcat.kcat`	float	Gathered from various sources, see later. One value for each rxns-substrate pair.
`model.kcat.source`	numeric	0 = manual; 1 = brenda (or similar); 2 = kmax; 3 = DLKcat
`model.kcat.notes`	string	Some additional information? Could also be multiple fields. For instance some information on how well the fuzzy-matching went (was it species specific, substrate specific, etc?)
`model.kcat.sequence`	string	Amino acid sequence. To be used as input for DLKcat, probably good to have anyway.

[ ] Write initiateKcatStructure that prepares the above structure from an existing (non-ec) model structure
[ ] Other parts needs to be gathered from user-provided files (similar to current GECKO requirement to gather Uniprot data). At the moment do not attempt to automate this, clear instructions for manual download will suffice, as these data will only be gathered once for a model. Note that DLKcat can also gather some of this.
[ ] Move existing fields that GECKO currently adds to the model structure, to the model.kcat structure (MW, uniprot, etc.)

This new structure should also provide all necessary information to be able to run DLKcat. Maybe more DLKcat code can be repurposed?

How to populate this structure with k_cat values is described in a separate Issue, this Issue is only about constructing the fields in the model.

I hereby confirm that I have:

[ ] Done this analysis in the master branch of the repository
[ ] Checked that a similar issue does not exist already

May 25 '22 20:05 edkerk

This is now generated by makeEcModel in branch feat/makeEcModel (feedback on that function itself are welcome here: #161).

The structure changed a bit to what was proposed above. It is now called model.ec, instead of model.kcat.

Some fields are still "one for each kcat value", so multiple substrates, isoenzymes and kcat sources means multiple entries to the same reaction. However, to avoid unnecessary duplication, some other fields only have one entry each (protein and metabolite information). New scheme:

The following fields have one entry per kcat value. So if a reaction is reversible, have multiple substrates, multiple enzymes, kcats from multiple sources etc., each of the fields will have another entry.

Field	Type	Description
`model.ec.rxns`	string	Reaction IDs gathered directly from model (after expansion and making irreversible).
`model.ec.enzyme`	string	UniprotIDs by matching to genes in `model.grRules`
`model.ec.subunits`	float	Number of subunits that make up whole enzyme complex, `1` by default.
`model.ec.substrate`	string	Metabolite name of the substrate, as gathered from `model.metNames`.
`model.ec.reverse`	logical	Gathered directly from the model, reverse reactions will have double entries, of which half of them have reverse=true. Might be pointless, as the reacion IDs also indicate reversibility (`_REV`).
`model.ec.kcat`	float	Gathered from various sources, see later. One value for each rxn-substrate-enzyme (subunit) combination.
`model.ec.source`	string	Where kcat value is derived from, e.g. 'manual', 'dlkcat', 'kcatdb'. Previously suggeted to be numeric, might actually be a better solution.
`model.ec.notes`	string	Whatever notes the user wants to add
`model.ec.eccodes`	string	Gathered directly from model, can be further populated via e.g. Uniprot/KEGG? Only used for classical GECKO ec-matching.

The following fields are non-redundant and have only unique entries.

Field	Type	Description
`model.ec.gene`	string	Directly from `model.genes`, to indicate which uniprot IDs belong to which gene (see next field).
`model.ec.uniprot`	string	Uniprot IDs for each gene.
`model.ec.mw`	string	Molecular weight for each protein.
`model.ec.sequence`	string	Amino acid sequence for each protein.
`model.ec.concs`	float	Measured concentration of each protein.
`model.ec.metNames`	string	Directly from `model.metNames`, to indicate which metSmiles belong to which metabolite (see next field).
`model.ec.metSmiles`	string	SMILES for each metabolite.

Jun 16 '22 10:06 edkerk

@edkerk fyi the ecModels container can run on any branch in this repository. Let me know if/when you'd like to run some tests (generate ecModels for the 5 organisms).

Jun 16 '22 19:06 mihai-sysbio

This is now implemented with #166, but of course the scheme might be modified a bit during the ongoing refactoring.

Jun 28 '22 09:06 edkerk

FYI, after our discussion, this scheme will be modified, to end up with one model.ec.kcat entry for each complex (which can have multiple subunits associated).

Jul 01 '22 07:07 edkerk

Simplified scheme introduced with #167

One entry per reaction
- Each entry contains 1 kcat value that is integrated in the model (when applyKcatConstraints is run)
- No duplicate entries with alternative substrates, kcat sources or subunits of a complex.
rxnEnzMat sparse matrix indicates which enzymes are present for each reaction, with non-zero values indicating the copy number if it forms a complex.

Field	Type	Description
`model.ec.rxns`	string	Reaction IDs gathered directly from model (after expansion and making irreversible).
`model.ec.rxnEnzMat`	sparse matrix	Somewhat comparable to rxnGeneMat, but Enz refers to model.ec.enzymes entries, and this matrix can have any positive integer to indicate the number of subunits in a complex.
`model.ec.kcat`	float	Gathered from various sources, see later. One value for each rxn-substrate-enzyme (subunit) combination.
`model.ec.source`	string	Where kcat value is derived from, e.g. 'manual', 'dlkcat', 'kcatdb'. Previously suggeted to be numeric, might actually be a better solution.
`model.ec.notes`	string	Whatever notes the user wants to add
`model.ec.eccodes`	string	Gathered directly from model, can be further populated via e.g. Uniprot/KEGG? Only used for classical GECKO ec-matching.

The following fields are non-redundant and have only unique entries. Metabolite information is left out in comparison with earlier suggestion.

Field	Type	Description
`model.ec.genes`	string	Directly from `model.genes`, to indicate which uniprot IDs belong to which gene (see next field).
`model.ec.enzymes`	string	Uniprot IDs for each gene.
`model.ec.mw`	string	Molecular weight for each protein.
`model.ec.sequence`	string	Amino acid sequence for each protein.
`model.ec.concs`	float	Measured concentration of each protein.

Jul 04 '22 22:07 edkerk

This is now fully functional.

Feb 14 '23 22:02 edkerk

GECKO GECKO copied to clipboard

makeEcModel: model-specific kcat data in dedicated structure

Description of the new feature:

GECKO
GECKO copied to clipboard