GECKO
GECKO copied to clipboard
bug: very low fluxes, close to solver tolerance
Description of the bug:
The units by which enzymes are incorporated (mmol and g/gDCW) results in low stoichiometric coefficients, low UB for protein exchange reactions constrained by proteomics measurements, and low fluxes through these exchange reactions. These values can get close to or exceed the tolerance that is set for the LP solvers (1e-9).
This seems not to be an issue for ecYeast, but can particularly play a role when high-sensitivity proteomics data is integrated, as it is able to measure very low abundances (sub nanomolar). Also large kcat values can result in very low fluxes. However, I've only encountered issues in proteome constrained models (but at the same time I don't do much analysis with the _batch model).
In selected models, running flux minimization (solveLP(model,1)) throws an error that no solution is found. I don't know enough of the solver to pin down what the exact problem is. What more surprised me were errors in flexibilizeProteins, where increasing the UB of a measured protein exchange reaction by 10-fold, without changing anything else, resulted in a model that could no longer be solved (again, not enough knowledge of the solver to pin down the exact problem). Particularly this last problem is worrying, as a solution was found with the original UB (although resulting growth rate was too low), but increasing UB 'breaks' the model.
Expected behavior:
It is better not to have fluxes close to the solver's tolerance.
Inherently I do not see any reason why the enzyme usage coefficients and UBs could not be multiplied by 1000. This way, the enzyme usage and measured protein exchange reactions are in micromole/gDCW/h (instead of millimole), while the protein pool exchange reaction is in mg/gDW (instead of g/gDW).
Suggested solution:
A quick work-around to solve this for proteomics-integrated models is by making the following changes in ecModel and ecModel_batch before running generate_protModels, to change the usage coefficients 1000-fold:
protIdx = find(startsWith(ecModel.mets,'prot_'));
pExcIdx = find(contains(ecModel.rxns,'prot_'));
for i=1:length(protIdx)
enzRxnIdx = find(ecModel.S(protIdx(i),:)); % All reactions involving protein
enzRxnIdx = setdiff(enzRxnIdx,pExcIdx); % Leave out protein exchange
coeffs = ecModel.S(protIdx(i),enzRxnIdx)*1000;
ecModel.S(protIdx(i),enzRxnIdx) = coeffs;
end
And make the following two changes in generate_protModels.m to also increase the UBs 1000-fold:
...
% multiply proteomics data
abundances = cell2mat(absValues(1:grouping(i)))*1000;
...
% multiply Ptot
[ecModelP,usagesT,modificationsT,~,coverage] = constrainEnzymes(ecModelP,f,GAM,Ptot(i)*1000,pIDs,abundances,Drate(i),flexGUR);
...
Note that some reported metrics do no longer make sense (for instance, "Total protein in model = 217 g/gDW"), but taking into consideration that the units of usage have changed, these models would still perform the same (without the potential problems of very low fluxes).
One could argue that these model changes should rather be implemented from the beginning, when the ecModel is generated by convertToEnzymeModel, so that the definition of enzyme usage flux is consistent across all ec-models. But this is quite a drastic change, might introduce confusion if an earlier ec-model is then used with a later version of GECKO. For now, I can make these changes in a dedicated branch, to be used with my current project, and available for other users running into similar issues. In case of a future major overhaul of GECKO, this might be something to consider to implement. Or would you argue that this is critical enough to break backwards compatibility?
Context
I hereby confirm that I have:
- [x] Tested my code with all requirements for running GECKO
- [x] Done this analysis in the
masterbranch of the repository - [x] Checked that a similar issue does not exist already
Some side remarks:
- The problem that you describe with increasing the upper bound is probably due to the overall increased range of values. You mention that some bounds are in the order of 1e-9 so the range to a 'normal' upper bound of a thousand spans twelve orders of magnitude. If you multiply by a thousand you suddenly span a range of 15 orders of magnitude. That's problematic for a solver.
- Some solvers can automatically scale a problem but may have problems maintaining tolerance thresholds after scaling back the problem to its original values. If you can find a reasonable biological argument for entering certain values into the model on a better scale, I think that's an elegant solution.
Sounds like that could be the problem, not so much the low values, but the large span in values. But multiplying the lowest bounds (e.g. 1e-9) with a 1000-fold would increase their value (to e.g. 1e-6) and thereby decrease the span, right?
And while it sounds like just a quick fix, multiply by 1000, this could easily be rationalized by stating that the coefficients of enzyme usage are in umol instead of mmol. It's just a matter of defining the units, the same for the protein pool exchange reaction upper bound being defined as mg/gDW instead of g/gDW. And the reported metrics that do no longer make sense, those functions would require to instead use the new unit definition. But if that would be hard-coded, then it is no longer backwards compatible.
Sounds like that could be the problem, not so much the low values, but the large span in values. But multiplying the lowest bounds (e.g. 1e-9) with a 1000-fold would increase their value (to e.g. 1e-6) and thereby decrease the span, right?
Sorry, I must have misread you. If you only scale the "problematic" bounds, then indeed you are decreasing the span. In that case, your following statement leads me to be believe that you have lifted the bound of the constraint above the tolerance threshold such that it can no longer be treated as satisfied?
increasing the UB of a measured protein exchange reaction by 10-fold, without changing anything else, resulted in a model that could no longer be solved
There are actually a number of published (simple FBA) models that are infeasible if you set a more stringent threshold or use an exact solver. Usually, some low coefficient biomass component is actually being ignored, i.e., flux is zero when it should be small but positive.
And while it sounds like just a quick fix, multiply by 1000, this could easily be rationalized by stating that the coefficients of enzyme usage are in umol instead of mmol. It's just a matter of defining the units, the same for the protein pool exchange reaction upper bound being defined as mg/gDW instead of g/gDW. And the reported metrics that do no longer make sense, those functions would require to instead use the new unit definition. But if that would be hard-coded, then it is no longer backwards compatible.
Sounds like a great solution to me!
Sorry, I must have misread you. If you only scale the "problematic" bounds, then indeed you are decreasing the span. In that case, your following statement leads me to be believe that you have lifted the bound of the constraint above the tolerance threshold such that it can no longer be treated as satisfied?
Indeed, it is only the enzyme usage coefficients and UB of enzyme exchange reactions. The non-ec part of the model remains untouched.
Repurposing this Issue to continue the discussion started in https://github.com/SysBioChalmers/GECKO/pull/180#issuecomment-1312189599.
@feiranl suggested:
As I understand, GECKO2 is converting mmol to gram in enzyme usage rxns, which causes small fluxes. We can actually just change the unit of protein pool from gram to mg, then the enzyme usage from protein pool conversion will be from mmol to mg, which actually scales up 3 magnitude, and all units in most of reactions are in mmol, except the protein pool rxn.
I now think this actually doesn't work. Below is an example with hexokinase of size 53.7 kDa, and a kcat of 274 /sec:
274 /sec = 986400 /hour --> 1/kcat = 1.0e-6
So the metabolic reaction becomes:
Glucose + ATP + 1.0e-6 P04806 -> Glc6P + ADP + Pi
And the draw_prot reaction is:
53.7 prot_pool -> P04806, where is indicated how many grams of prot_pool is required to make 1 mmol of P04806.
With a metabolic flux of 5 mmol/gDCW/h, the enzyme usage is 5e-6, and that will also be the flux through the draw_prot reaction. The amount of prot_pool used is 53.7*5e-6=2.69e-4 gram protein.
Because of the low stoichiometric coefficient of P04806 in the metabolic reaction, the draw_prot reaction will have a low flux value (= core of the problem). To increase the enzyme usage flux, one can specify the enzyme usage as
Glucose + ATP + 1.0e-3 P04806 -> Glc6P + ADP + Pi
And the draw_prot reaction as
0.0537 prot_pool -> P04806.
The usage flux is then 5e-3 (solving our key problem), and the prot_pool used is 0.0537*5e-3=2.69e-4 gram protein, the same as before. The prot_pool upper bound still reflects g/gDCW. We can change this to mg/gDCW, but this does not affect the stochiometric coefficient for P04806 in the metabolic reaction, and therefore does not give a lower flux through the draw_prot reaction. So the only way to resolve the issue with the low fluxes is to modify the stochiometric coefficients of the proteins in the metabolic reactions, not by scaling the protein pool.
If draw_prot reaction is instead replaced by a prot_exchange reaction, constrained with UB of proteomics experiments, then the proteomic data should be given as umol/gDCW, instead of the mmol/gDCW that is used in GECKO1&2.
Finally, while this decreases the range of flux values in the model, the example above has a relatively high flux of 5 mmol/gDCW/h. It might therefore be worthwhile to correct for 6 orders of magnitude instead of 3? So give protein usage and protein levels in nmol?
Specifically @feiranl and @Yu-sysbio, do you agree?
This looks very good, and I agree with you on this modification: multiplying coefficient of enzyme in metabolic reaction by 1e3 (or 1e6) and coefficient of prot_pool in prot_pool reaction by 1e-3 (or 1e-6). We can proceed with this, but must clarify somewhere (including codes, manuscript and figures) that the coefficient of enzyme in metabolic reaction is not 1/kcat, but is 1e3/kcat (or 1e6/kcat) instead, in which the unit of kcat is /hour. This conversion however confuses biologists a little bit, but it is not a big issue.
Inspired by this, I here propose another idea for your consideration, which can avoid introducing 3 or 6 orders of magnitude in reactions. The idea is to integrate molecular weight (MW) into metabolic reaction, i.e., use MW/kcat rather than 1/kcat. The term MW/kcat is just the protein cost proposed in my previous studies, and this will also link the full to the light version.
To illustrate, I also use the example of hexokinase: MW = 53.7 kDa (or 53700 Da), kcat = 274 /sec = 986400 /hour, hence MW/kcat = 53.7/986400 = 53.7*e-6.
The metabolic reaction:
Glucose + ATP + 53.7*e-6 P04806 -> Glc6P + ADP + Pi
And the prot_pool reaction:
prot_pool -> P04806
With a metabolic flux of 5 mmol/gDCW/h, the flux through the draw_prot reaction is 53.7*5e-6=2.69e-4 gram protein.
We can also have high flux through the draw_prot reaction by adopting the unit of Da rather than kDa, i.e., MW/kcat = 53700/986400 = 53.7*e-3.
The metabolic reaction becomes:
Glucose + ATP + 53.7*e-3 P04806 -> Glc6P + ADP + Pi
And the prot_pool reaction is unchanged, but the unit is changed:
prot_pool -> P04806
With a metabolic flux of 5 mmol/gDCW/h, the flux through the draw_prot reaction is 53.7*5e-3=2.69e-1 microgram protein. This changes the unit of the total protein pool to mgProtein/gDCW.
This will also change the way to integrate proteomics data. The input of proteomics data should be changed to g/gDCW or mg/gDCW.
This is actually a very elegant solution. Using Da greatly reduces the range of flux values, while it indeed makes full and light version work similarly.