GECKO icon indicating copy to clipboard operation
GECKO copied to clipboard

loadComplexData: read Complex Portal or similar input for subunit numbers

Open edkerk opened this issue 2 years ago • 6 comments

Description of the new feature:

If the data is available, it could be taken into account that not all subunits are necessarily present in equal amounts in complexes. This will involve:

  • Gathering the protein complex data (e.g. from Complex Portal)
  • Define how to represent the complex data in a suitable way in the model structure (not trivial, as adding stoichiometric coefficients in grRules are not ideal as they'd need to be parsed out each time, while something like rxnGeneMat would not work if there are alternative configurations of a complex (e.g. (2 subunit A + 1 subunit B) or (2 subunit A + 1 subunit C)). edit: the rxnGeneMat option would work if isoenzymes are first split by expandModel
  • Consider that kcat values that are reported are (typically) for the whole functional complex, not for individual subunits

edkerk avatar May 25 '22 22:05 edkerk

The number of subunits is now included in the model.ec structure in #156 and is considered when applying the kcat constraints when using applyKcatConstraints from feat/makeEcModel branch.

It does not require the rxnGeneMat strategy as suggested above.

At the moment it assumes a subunit stoichiometry of 1:1(:1:1... etc.). Gathering and parsing Complex Portal data is not yet implemented. This could first be accomplished for organisms that are present in Complex Portal. On top of that, one could imagine complex prediction based on sequence similarity, but that would be implemented much later.

edkerk avatar Jun 16 '22 11:06 edkerk

applyKcatConstraints can deal with the subunit information in model.ec.rxnEnzMat since PR #170, so this Issue is rather about populating the model.ec.rxnEnzMat structure with subunit information from e.g. Complex Portal.

edkerk avatar Jul 08 '22 22:07 edkerk

Obtaining the information from the complex portal every time it is required may take some time. I already have a getComplexData function to download all the information from the database, or from a specific organism available in complex portal, and save it in a JSON file. Following this structure:

[ { "complexID": "CPX-3244", "name": "SCF-Das1 ubiquitin ligase complex", "specie": "Saccharomyces cerevisiae; 559292", "geneName": [ "CDC34", "DAS1", "HRT1", "SKP1", "CDC53" ], "protID": [ "P14682", "P47005", "Q08273", "P52286", "Q12018" ], "stochiometry": [ 1, 1, 1, 1, 1 ] } ]

So in order to integrate the information into model.ec.rxnEnzMat the function loadComplexData will load the file (can be done with jsondecode(.json), directly into makeEcModel) and makeEcModel will fill in the stoichiometry mapping the rules. So, should I add the gene Systematic Name ?

ae-tafur avatar Oct 04 '22 16:10 ae-tafur

I have just noticed that there is a discussion that includes this issue in #174.

ae-tafur avatar Oct 04 '22 19:10 ae-tafur

It touches on this issue, but what you're raising here is really the core of what is proposed in the opening post.

Your getComplexData function sounds exactly what is required, but it should probably make some "complexData" structure (which could alternatively could be populated by other sources / approaches). An additional function (applyComplexData?) could then parse this data to modify model.ec.rxnEnzMat.

Currently, model.ec.enzymes would allow matching with the protID, so gene Systematic Name is perhaps not strictly essential right now, but it might make it slightly simpler to match to grRules, so let's include Systematic Name for now.

The main complexity with applyComplexData I can identify is how to match the complexes to reactions. In your example:

  1. if a (split*) reaction is annotated with "P14682", "P47005", "Q08273", "P52286", "Q12018",
  2. assume that this reaction is indeed catalyzed by the "SCF-Das1 ubiquitin ligase complex",
  3. use the defined complex stoichiometry for the reaction mentioned in 1.

(*) isozymes are split, so all proteins/genes annotated to a reaction are subunits.

This is probably a fair assumption (it is unlikely that there are different versions of a complex (all with the same proteins, but with different stoichiometries) that would catalyze different reactions).

However. what if the gene association in the model is missing one of the subunits? It might be good to include an additional output:

  1. if a (split) reaction does not have a 100% match with any complex when considering its proteins,
  2. but this reaction does match with 75% of the proteins in one of the complexes,
  3. give an output structure which contains the reaction ID, its current gene or protein association, the proposed complex association (focusing on the protein identities, not stoichiometries), some identifier to find this proposed complex in Complex Portal,
  4. describe in the applyComplexData function that the output indicates that no complex could be found, but the user might want to manually curate the proposed complex, as the gene association might have not been complete.

edkerk avatar Oct 04 '22 19:10 edkerk

  • Actually, loading the json file saved by getComplexData with the function jsondecode(json) creates a structure of the data complex data containing complexID, name, specie, geneName, protID, and stochiometry, so we can add extra data for each complex or extra info from other sources.

  • since model.ec.enzymes can handle the matching with the protID, we can use it approach better, and avoid translate from protID to Systematic Name.

  • We can load the complex data in makeEcModel from the json data and pass this as parameter to applyComplexData, that will take care of the complex matching to the rules in the reaction, evaluating as you propose in 1-4. This step should be after split the rules and create the reaction with the proteins

ae-tafur avatar Oct 04 '22 20:10 ae-tafur