GECKO icon indicating copy to clipboard operation
GECKO copied to clipboard

readKcatTable: for loading curated kcat data

Open edkerk opened this issue 3 years ago • 6 comments
trafficstars

Description of the new feature:

Write a function (readKcatTable) that can load some standardized TSV file with manually curated kcat values, for instance derived from kmax determination. It should contain the relevant information that can be used to populate the model.kcat structure.

I hereby confirm that I have:

  • [ ] Done this analysis in the master branch of the repository
  • [ ] Checked that a similar issue does not exist already

edkerk avatar May 25 '22 21:05 edkerk

Could you assign me to the task here? With this function, we can not only manually curate kcat, e.g., using kmax, but also load kcat from GotEnzymes. The latter could be in this function or in another, but I can also do that.

Yu-sysbio avatar Jun 30 '22 15:06 Yu-sysbio

Indeed, loading kcat values from GotEnzymes could also be covered by this function. Besides GotEnzymes there would currently be only DLKcat that would provide some table with kcat values, but we can easily support both file formats if needed.

edkerk avatar Jul 01 '22 07:07 edkerk

This partially overlaps with #157, where the latter issue is about the file format (and this would also be how e.g. BRENDA database is provided), while the issue here is about the function that reads the file.

edkerk avatar Jul 01 '22 07:07 edkerk

Thanks for assigning me here!

I just thought that the input of this function readKcatTable would be either 1) manually curated kcat values (the input file should contain model rxn id for mapping), 2) predicted kcat values downloaded or retrieved via API from GotEnzymes (the input file should contain KEGG reaction ID and compound ID, and the model file should also contain KEGG info for mapping), or 3) predicted kcat values by DLKcat (maybe DLKcat prediction already links kcat value to model rxn id?). The output would be a file that will be used to populate the model.kcat structure.

Do you expect that this function readKcatTable should also read the kcat database file generated in #157? But it is then complicated to map onto the model structure as the kcat database file would just contain EC numbers and substrate names which should be correctly linked to model rxn and met ids.

Yu-sysbio avatar Jul 01 '22 11:07 Yu-sysbio

This function should indeed just read 1) manually curated kcat values (see also https://github.com/SysBioChalmers/GECKO/discussions/169). 2) and 3) would be separate functions, as it coudl also be retrieved via API (GotEnzymes) or directly parsed to DLKcat, while the function should then gather the necessary query data from the model (in contrast, readKcatTable is really one direction, file with kcat values -> model).

Additional complication with 2) and 3) is that they might provide multiple kcat values (substrates, subunits), and the maximum value needs to be selected. Currently (https://github.com/SysBioChalmers/GECKO/discussions/169), readKcatTable is also routed through selectKcatValue, in case the file would contain multiple kcat entries for the same reaction, but it could also directly integrate it in the model.ec.kcat field if that makes more sense.

edkerk avatar Jul 08 '22 22:07 edkerk

selectKcatValue can now take a kcatList of kcat values that are matched to specific reactions.

readKcatTable should therefore read a TSV file and/or GotEnzymes (could also be separate function), and match these kcat values to the model reactions, and also giving a kcatList structure as output. Looking at what GotEnzymes can output, this means the following steps:

  • Match the KEGG reaction identifiers from GotEnzymes with KEGG reaction annotations in the model (in model.rxnMiriams, can be easily extracted with RAVEN's extractMiriams(model.rxnMiriams, 'KEGG')), and then use the relevant reaction identifier in kcatList.rxns.
  • Match the gene identifier from GotEnzymes with model.ec.genes, and use this index together with model.ec.rxnEnzMat to uniquely match isozymes to their respective reaction (as isozymes would result in multiple matches by reaction KEGG identifiers, as done in the above step), and fill this in kcatList.genes
  • The kcatList.substrate should ideally have the metabolite name (so not just the KEGG metabolite identifier that GotEnzymes currently provides), but this is actually not critical, as selectKcatValue will actually not look at this field when selecting which kcat value to use.

So I suggest to make sure that GotEnzymes output can be used, and this can then be modified to also take other manual curated lists of kcat values.

edkerk avatar Sep 28 '22 09:09 edkerk

This is now implemented in applyCustomKcats (PR #199). Code in a way that it should recognize different amounts of input data.

edkerk avatar Feb 14 '23 22:02 edkerk

@ae-tafur, does applyCustomKcats also allow for specifying kcats of forward and reverse reactions? Or does it currently assume forward? Perhaps there should be a column indicating "reverse", which can be used if the reaction identifiers are used. Then applyCustomKcats can append _REV to the reaction identifier when looking for matches in model.ec.rxns.

edkerk avatar Feb 14 '23 22:02 edkerk

when using the reactions identifier, it will only get the exact match, since uses strcmpi function. So you need to define, for example,r_0003, r_0003_REV or r_0004_EXP_1, r_0004_EXP_2, in the rxns column

ae-tafur avatar Feb 14 '23 22:02 ae-tafur

Alright, let's clarify this in the protocol.

edkerk avatar Feb 14 '23 23:02 edkerk

Clarified in the function documentation, which was more suitable.

edkerk avatar Mar 05 '23 17:03 edkerk