GECKO
GECKO copied to clipboard
makeEcModel: model-specific kcat data in dedicated structure
Description of the new feature:
Instead of having kcat values only directly integrated as stoichiometric coefficients, keep a separate data structure with model-specific kcat values.
Each field in the model.kcat structure has the same length, where each entry corresponds to a unique kcat value, that can be assigned to a unique combination of reaction, metabolite and enzyme.
The purpose of this structure is that it can be populated from different sources, (e.g. DLKcat, BRENDA/SABIO-RK, manual curation) and changed at any time, while these values can then be "applied" to the ec-model to update the kcat values that are used.
Take particular notice:
-
If a reaction has multiple co-substrates, there can be multiple entries, each with a different
.metNames-associated kcat value [a later function might then choose the lowest, highest or mean value to use as kcat for that reaction in the ec-model]. -
If a reaction has multiple genes associated (whether via AND or OR relationships), they can all have a separate entry. Downstream functions will again choose which of the provided kcat values will be used to populate the model.
| Field | Type | Description |
|---|---|---|
model.kcat.rxns |
string | Reaction IDs gathered directly from model |
model.kcat.eccodes |
string | Gathered directly from model, can be further populated via e.g. Uniprot/KEGG? |
model.kcat.metNames |
string | Gathered directly from model, each substrate separatedly. If reaction is in reverse (.reverse=true), then product is shown |
model.kcat.metSmiles |
string | Parsed, DLKcat has code for this? |
model.kcat.gene |
string | Gathered directly from model, consider gene association (and/or rules)? |
model.kcat.uniprot |
string | Can be parsed via Uniprot dump with GECKO code? Perhaps not necessary |
model.kcat.mw |
numeric | Parsed from Uniprot/KEGG? Otherwise, predict from protein FASTA? |
model.kcat.reverse |
logical | Gathered directly from the model, reverse reactions will have double entries, of which half of them have reverse=true |
model.kcat.kcat |
float | Gathered from various sources, see later. One value for each rxns-substrate pair. |
model.kcat.source |
numeric | 0 = manual; 1 = brenda (or similar); 2 = kmax; 3 = DLKcat |
model.kcat.notes |
string | Some additional information? Could also be multiple fields. For instance some information on how well the fuzzy-matching went (was it species specific, substrate specific, etc?) |
model.kcat.sequence |
string | Amino acid sequence. To be used as input for DLKcat, probably good to have anyway. |
- [ ] Write
initiateKcatStructurethat prepares the above structure from an existing (non-ec) model structure - [ ] Other parts needs to be gathered from user-provided files (similar to current GECKO requirement to gather Uniprot data). At the moment do not attempt to automate this, clear instructions for manual download will suffice, as these data will only be gathered once for a model. Note that DLKcat can also gather some of this.
- [ ] Move existing fields that GECKO currently adds to the
modelstructure, to themodel.kcatstructure (MW, uniprot, etc.)
This new structure should also provide all necessary information to be able to run DLKcat. Maybe more DLKcat code can be repurposed?
How to populate this structure with kcat values is described in a separate Issue, this Issue is only about constructing the fields in the model.
I hereby confirm that I have:
- [ ] Done this analysis in the
masterbranch of the repository - [ ] Checked that a similar issue does not exist already
This is now generated by makeEcModel in branch feat/makeEcModel (feedback on that function itself are welcome here: #161).
The structure changed a bit to what was proposed above. It is now called model.ec, instead of model.kcat.
Some fields are still "one for each kcat value", so multiple substrates, isoenzymes and kcat sources means multiple entries to the same reaction. However, to avoid unnecessary duplication, some other fields only have one entry each (protein and metabolite information). New scheme:
The following fields have one entry per kcat value. So if a reaction is reversible, have multiple substrates, multiple enzymes, kcats from multiple sources etc., each of the fields will have another entry.
| Field | Type | Description |
|---|---|---|
model.ec.rxns |
string | Reaction IDs gathered directly from model (after expansion and making irreversible). |
model.ec.enzyme |
string | UniprotIDs by matching to genes in model.grRules |
model.ec.subunits |
float | Number of subunits that make up whole enzyme complex, 1 by default. |
model.ec.substrate |
string | Metabolite name of the substrate, as gathered from model.metNames. |
model.ec.reverse |
logical | Gathered directly from the model, reverse reactions will have double entries, of which half of them have reverse=true. Might be pointless, as the reacion IDs also indicate reversibility (_REV). |
model.ec.kcat |
float | Gathered from various sources, see later. One value for each rxn-substrate-enzyme (subunit) combination. |
model.ec.source |
string | Where kcat value is derived from, e.g. 'manual', 'dlkcat', 'kcatdb'. Previously suggeted to be numeric, might actually be a better solution. |
model.ec.notes |
string | Whatever notes the user wants to add |
model.ec.eccodes |
string | Gathered directly from model, can be further populated via e.g. Uniprot/KEGG? Only used for classical GECKO ec-matching. |
The following fields are non-redundant and have only unique entries.
| Field | Type | Description |
|---|---|---|
model.ec.gene |
string | Directly from model.genes, to indicate which uniprot IDs belong to which gene (see next field). |
model.ec.uniprot |
string | Uniprot IDs for each gene. |
model.ec.mw |
string | Molecular weight for each protein. |
model.ec.sequence |
string | Amino acid sequence for each protein. |
model.ec.concs |
float | Measured concentration of each protein. |
model.ec.metNames |
string | Directly from model.metNames, to indicate which metSmiles belong to which metabolite (see next field). |
model.ec.metSmiles |
string | SMILES for each metabolite. |
@edkerk fyi the ecModels container can run on any branch in this repository. Let me know if/when you'd like to run some tests (generate ecModels for the 5 organisms).
This is now implemented with #166, but of course the scheme might be modified a bit during the ongoing refactoring.
FYI, after our discussion, this scheme will be modified, to end up with one model.ec.kcat entry for each complex (which can have multiple subunits associated).
Simplified scheme introduced with #167
- One entry per reaction
- Each entry contains 1 kcat value that is integrated in the model (when
applyKcatConstraintsis run) - No duplicate entries with alternative substrates, kcat sources or subunits of a complex.
- Each entry contains 1 kcat value that is integrated in the model (when
rxnEnzMatsparse matrix indicates which enzymes are present for each reaction, with non-zero values indicating the copy number if it forms a complex.
| Field | Type | Description |
|---|---|---|
model.ec.rxns |
string | Reaction IDs gathered directly from model (after expansion and making irreversible). |
model.ec.rxnEnzMat |
sparse matrix | Somewhat comparable to rxnGeneMat, but Enz refers to model.ec.enzymes entries, and this matrix can have any positive integer to indicate the number of subunits in a complex. |
model.ec.kcat |
float | Gathered from various sources, see later. One value for each rxn-substrate-enzyme (subunit) combination. |
model.ec.source |
string | Where kcat value is derived from, e.g. 'manual', 'dlkcat', 'kcatdb'. Previously suggeted to be numeric, might actually be a better solution. |
model.ec.notes |
string | Whatever notes the user wants to add |
model.ec.eccodes |
string | Gathered directly from model, can be further populated via e.g. Uniprot/KEGG? Only used for classical GECKO ec-matching. |
The following fields are non-redundant and have only unique entries. Metabolite information is left out in comparison with earlier suggestion.
| Field | Type | Description |
|---|---|---|
model.ec.genes |
string | Directly from model.genes, to indicate which uniprot IDs belong to which gene (see next field). |
model.ec.enzymes |
string | Uniprot IDs for each gene. |
model.ec.mw |
string | Molecular weight for each protein. |
model.ec.sequence |
string | Amino acid sequence for each protein. |
model.ec.concs |
float | Measured concentration of each protein. |
This is now fully functional.