GECKO
GECKO copied to clipboard
readKcatTable: for loading curated kcat data
Description of the new feature:
Write a function (readKcatTable) that can load some standardized TSV file with manually curated kcat values, for instance derived from kmax determination. It should contain the relevant information that can be used to populate the model.kcat structure.
I hereby confirm that I have:
- [ ] Done this analysis in the
masterbranch of the repository - [ ] Checked that a similar issue does not exist already
Could you assign me to the task here? With this function, we can not only manually curate kcat, e.g., using kmax, but also load kcat from GotEnzymes. The latter could be in this function or in another, but I can also do that.
Indeed, loading kcat values from GotEnzymes could also be covered by this function. Besides GotEnzymes there would currently be only DLKcat that would provide some table with kcat values, but we can easily support both file formats if needed.
This partially overlaps with #157, where the latter issue is about the file format (and this would also be how e.g. BRENDA database is provided), while the issue here is about the function that reads the file.
Thanks for assigning me here!
I just thought that the input of this function readKcatTable would be either 1) manually curated kcat values (the input file should contain model rxn id for mapping), 2) predicted kcat values downloaded or retrieved via API from GotEnzymes (the input file should contain KEGG reaction ID and compound ID, and the model file should also contain KEGG info for mapping), or 3) predicted kcat values by DLKcat (maybe DLKcat prediction already links kcat value to model rxn id?).
The output would be a file that will be used to populate the model.kcat structure.
Do you expect that this function readKcatTable should also read the kcat database file generated in #157? But it is then complicated to map onto the model structure as the kcat database file would just contain EC numbers and substrate names which should be correctly linked to model rxn and met ids.
This function should indeed just read 1) manually curated kcat values (see also https://github.com/SysBioChalmers/GECKO/discussions/169). 2) and 3) would be separate functions, as it coudl also be retrieved via API (GotEnzymes) or directly parsed to DLKcat, while the function should then gather the necessary query data from the model (in contrast, readKcatTable is really one direction, file with kcat values -> model).
Additional complication with 2) and 3) is that they might provide multiple kcat values (substrates, subunits), and the maximum value needs to be selected. Currently (https://github.com/SysBioChalmers/GECKO/discussions/169), readKcatTable is also routed through selectKcatValue, in case the file would contain multiple kcat entries for the same reaction, but it could also directly integrate it in the model.ec.kcat field if that makes more sense.
selectKcatValue can now take a kcatList of kcat values that are matched to specific reactions.
readKcatTable should therefore read a TSV file and/or GotEnzymes (could also be separate function), and match these kcat values to the model reactions, and also giving a kcatList structure as output. Looking at what GotEnzymes can output, this means the following steps:
- Match the KEGG reaction identifiers from GotEnzymes with KEGG reaction annotations in the model (in
model.rxnMiriams, can be easily extracted with RAVEN'sextractMiriams(model.rxnMiriams, 'KEGG')), and then use the relevant reaction identifier inkcatList.rxns. - Match the gene identifier from GotEnzymes with
model.ec.genes, and use this index together withmodel.ec.rxnEnzMatto uniquely match isozymes to their respective reaction (as isozymes would result in multiple matches by reaction KEGG identifiers, as done in the above step), and fill this inkcatList.genes - The
kcatList.substrateshould ideally have the metabolite name (so not just the KEGG metabolite identifier that GotEnzymes currently provides), but this is actually not critical, asselectKcatValuewill actually not look at this field when selecting which kcat value to use.
So I suggest to make sure that GotEnzymes output can be used, and this can then be modified to also take other manual curated lists of kcat values.
This is now implemented in applyCustomKcats (PR #199). Code in a way that it should recognize different amounts of input data.
@ae-tafur, does applyCustomKcats also allow for specifying kcats of forward and reverse reactions? Or does it currently assume forward? Perhaps there should be a column indicating "reverse", which can be used if the reaction identifiers are used. Then applyCustomKcats can append _REV to the reaction identifier when looking for matches in model.ec.rxns.
when using the reactions identifier, it will only get the exact match, since uses strcmpi function. So you need to define, for example,r_0003, r_0003_REV or r_0004_EXP_1, r_0004_EXP_2, in the rxns column
Alright, let's clarify this in the protocol.
Clarified in the function documentation, which was more suitable.