cobrapy
cobrapy copied to clipboard
Top-down restructuring of Model / Reaction / Metabolite objects
So this was brought up in #303 and #389, it might be worth talking about the relative merits / inconvenience of having Reaction, Metabolite, and Gene objects aware of the model they're a part of.
If we removed the Model object from them, it might make copy operations much faster and code maintenance in general much more straight-forward. (and maybe remove the odd errors that require a call to Model.repair).Metabolites and Reactions could then unambiguously belong to a number of different models.
Downsides would be that calls to things like metabolite.reactions would have to get replaced with method calls in Model.py, using perhaps packages like networkx to quickly compute parent/child nodes on the fly. Also, methods like Reaction.flux, Metabolite.summary, etc. would all have to be changed somehow.
@hredestig and I discussed at length today the merits and pit falls of reorganizing the cobrapy package completely. While we both agree that it'd be very nice to have, we're finding it hard to explain/quantify the value of spending the necessary amount of time on it. Some of our discussed pros were: mental clarity, proper encapsulation leading to better testing/ fewer bugs, ease of contribution for anyone, excitement about developing with/for a well structured, documented, and maintained open-source project. Some cons were: Major time sink, current version seems to fill the need, no feature addition, noone besides top contributors will benefit/notice.
I will spend some time tomorrow thinking about the complex dependencies between classes and functions. Maybe it's feasible to decide on a re-organization goal that can be achieved gradually.
cobrapy Structure
General Remarks
- design/concept whether to use object IDs or objects themselves as identifiers seems unclear
- it would serve clarity to remove legacy interface completely, it seems like there is a lot of code rot
- a new package for cobra plotting?
General Class Design Notes
- all attributes should appear in
__init__(can be set toNoneat first) - instances should not modify other instances (if wanted for backwards compatibility, let methods call external functions
- classes are for holding and accessing data/information
- should only ever act on their own direct information
- classes should be encapsulated, if a class wants to modify another, there needs to be a method for it, not act on its attributes directly!
cobra.core classes
- DictList
- aim seems to be behavior of a
pandas.Series - query method could be something interesting
- aim seems to be behavior of a
- Formula
- deprecated
- definition of valid formula regex seems narrow
- sum up molecular weight seems useful (units?!)
- Gene
- use a lexer instead of
astfor gene reaction rules? - attribute
functional - model integration rightly deprecated here, deferred to model manipulation
- use a lexer instead of
- Metabolite
- has taken over formula management
- accesses current
model.solutionfor primal, dual, and shadow prices - remove self from model and reactions
- summary method uses
cobra.flux_analysis.summary.metabolite_summary
- Model
- god class
- modifies way too many other classes directly
- Object
- property
id - attributes
name,notes,annotation - checks for existence of
_modelinself.__dict__
- property
- Reaction
- again gene reaction rule parser (lexer) could be useful
- accesses variables in models and generate their IDs
- flux expression
- modify bounds, primal, and dual
- remove self from model
- delete a reaction from all its references
- basic question: should reactions be (im)mutable? strong argument that modified reaction is a different reaction
- build reaction from string
- Solution
- needs better optlang understanding
- Species
MetaboliteandReactionboth inherit fromSpecies, why?
cobra.config
- modifies root logger! that's horrible
- there should be a config object such that it can be modified and used at runtime
cobra.design
- belongs into
cameo?
cobra.topology
- still relevant?
- should topology based metabolic investigations be a new package?
This is a really nice list, thanks for thinking about it @Midnighter! Agreed, there's certainly some cleaning up to do.
I think we should tentatively milestone this as 0.7, as I do think we want to get the optlang interface working (if messily) and worry about optimizing / streamlining the code down the road.
Reaction doesn't seem to inherit from Species... I really think Species and Metabolite could get merged.
On DictList, there's definitely some customized functionality for cobrapy at this point. But a lot of it does mirror pandas.Series. Subclassing pandas objects used to be a huge pain, but it seems things may have gotten better at this point. Maybe we just inherit from a pandas.Series?
Whats about the summary methods? Is this a list of methods that require knowledge of the model and solution environments? Even metabolite.reactions (ab)uses that, so perhaps add that to the list.
I had some more time on the train on the way home. In the following a few musings:
cobrapy Class Encapsulation
Details where class encapsulation is broken and other remarks.
- should model, reaction, and metabolite inherit from optlang classes? (probably not)
- think hard about interaction with solver
- there seems to be a separation between model and solver interface that is not always intuitive/obvious
Metabolite
_set_id_with_model- checks model metabolites
- assigns current model solver and modifies its constraint
- regenerates index on model metabolites
constraint- get on model's current solver's constraint
formula_weightand the periodic table, smarter format to store values and units?y- model solution y_dict
shadow_price- model solution attribute
remove_from_model- remove itself and potentially affected reactions from the model
summary- calls
cobra.flux_analysis.summary.metabolite_summaryseems a good intermediate solution
- calls
Reaction
General thought: should reactions be treated as immutable container types?
- attributes for genes, metabolites, model
- keeps track of bounds, variable type, forward and reverse variable
set_id_with_model- check model reactions
- gets variables
- regenerates reactions index
forward_variableandreverse_variable- access model's solver's variables
_reset_var_cache- set
objective_coefficient lower_boundandupper_bound- update variables
flux- gets variables' primal values
reduced_costs- gets variables' dual values
x- like
flux
- like
reversibility- depends on bounds
remove_from_model- modifies model, potentially metabolites
delete- similar to remove
update_forward_and_reverse_bounds- should reactions change solver problem?
Model
__setstate__- updates all contained objects
- attributes for context, solver, interface, and solution
copyseems intrusive but maybe has to beadd_metaboliteswhat's the point without reactions?_populate_solverexample of extra functionoptimizerepairsummarygood example...
Solution
- rather than all the model/solver interactions, have solution be built from solver with current values?
So if metabolites didn't have an attached model or reaction, Reactions could just be bags of metabolites and models could be bags of reactions. Model.metabolites could be generated on the fly if needed from a set union.
One concern is what happens when we call model.reaction.my_reaction.knock_out(). Context management as currently implemented would be bricked. And to update the solver, we'd have to check each reaction for possible changes, which would be a lot slower.
if we redid the context solution, one option could be to follow something like the model of pymc3, where reactions and metabolites could find the model (or solution) of interest on-the-fly by checking the global stack.
Having reactions immutable would be hard. Especially with bounds and ids. I am fine with reactions having a link to a model and interacting this way with the solver object since reactions actually have more connections to the solver object than the Model class itself. In that sense cobra.Model is not so much a god-class but more a glue-class syncing a set of reactions to the solver.
@Midnighter, here we go!
Hi all!
So, from our gitter discussion here are some points I would like to bring to the development:
Model attributes and inconsistent states
1. model.solution
This atribute doesn't make sense. If I solve the model, it will store a solution in an atribute inside the model. After that, the model can be manipulated, like changing bounds, adding reactions, changing the stoichiometry of reactions, etc. However, if I change the model state, the model.solution is no longer valid. My suggestion is that model.solution should be removed and it should be used as follows:
solution1 = model.optimize()
# do whatever modifications
# ...
solution2 = model.optimize()
# solution1 and solution2 are different and there is no confusion about what is inside the model.
2. What is a Reaction
When we look at Reaction, there is a model atribute. Also, the Model has a reactions atribute which is a DictList with reactions. In the add_reactions method, we change the Reaction.model atribute. So if I take a reaction from model1 and I add the same reaction to model2, the reaction is no longer in model1.
So, what is Reaction?
a) Is reaction a representation of any metabolic conversion? Or b) is the reaction a representation of a metabolic conversion in a specific model?
If a) reaction cannot have a model attribute. If b) we cannot add a reaction from a model and we need to build the reaction with a model. This raises implementation issues such as: if a) and the reaction can be in any model, then how do we represent the GPR. In the a) scenario, it should not be in the reaction.
Optimized ideas
We discussed the problem of how slow it is to get primals and duals from the solver. I think the implementation of optimize and/or solve should as simple as possible to start with. As we identify advance stuff, we can add protected methods, keyword arguments, etc. So, for optimize , my suggestion is:
(with @cdiener contribution)
optimize(..., reactions=self.reactions) returns all fluxes as default. When reactions=[] returns no fluxes and when reactions=[r1, r2, ...] returns only the subset of fluxes.
I have a bit of mixed feeling here, so I'll share my ideas.
1.Other advanced keyword arguments like shadow_prices=True, reduced_cots=True, irreducibly_inconsistent_constraints=False can be set on the method. The default values would be to return the most complete solution for a regular user.
-
Advanced implementations can interact with the solver directly to retrieve duals and primals and other information. Here my main concern is concurrent multithread implementations. We would need to be able to lock the model.
-
This is the most disruptive idea: get rid of
model.solve. Then if the user want a flux distribution he has to explicitly use a method to retrieve it:fba(model),pfba(model),moma(model)ormy_brilliant_method(model). The implementation and access to the solver would be done through less exposed methods.
I hope that we can use this discussion forum give more stability and coherence to cobrapy implementation.
Cheers,
Joao
Okay here are my 2 cents :sweat_smile:
Generally I think there are two topic contained here. One is the API design (1) and the other is the actual source code organisation underneath (2). I tend to think that API > code right now since (2) can be changed at any point without having to worry about versioning, whereas stabilizing the API is somewhat important given that cobrapy is now ~5 years old has a version < 1.0, e. g. no stable API.
Solution
I think we basically agreed that model.solution will be removed. Regarding what should be contained in a solution I'm kind of undecided. I think the idea to get only partial information is merited (I do not find myself checking shadow prices too often), however I would like to keep the API simple as well...
I performance should not be a design argument here. Most of the cobrapy algorithms use the optlang interface and other tricks to get around using the solution object. Also the performance of Solution is tricky to optimize anyways since even with an instantaneous optlang.Model.primal_values we still have to do something like
fluxes = {r.id: primals[r.forward_variable.name] - primal[r.reverse_variable.name] for r in model.reactions}
which in my benchmarks becomes the new bottleneck. So IMHO we should concentrate on the usability here and forget about performance for now.
Model - Reaction interleaving
I think this has already improved quite a lot. If you look at the (private) attributes of the Reaction class now you can see that Reaction no longer carries around attributes that are actually attributes of the solver such as _objective_coefficient, _lower_bound etc. All those are now accessed by methods on Reaction that act on the solver object using the appropriate methods from the solver. The only thing that remains are those methods, which (for me) are more like syntactic sugar to make model configuration easier. Note that all those methods throw the appropriate error when a Reaction is not connected to a model. The alternative would be to have all bounds setting, objective coefficient setting etc. managed by the Model class, which I think would make it less usable. Also there are some difficult cases like changing the reaction stoichiometry where it would be difficult to keep Model and Reactions in sync.
So I think having some methods on Reaction that can only be active when associated with a model is a good price to pay for the much nicer API (look at other packages where bounds have to be passed to the optimization methods and things like that).
Idea 3 from @joaocardoso
I get where you are coming from. The optimize method can be misleading, especially when getting all fluxes since they are pretty arbitrary. Given the centrality of optimize in cobrapy I think removing it would create problems but maybe in a far future it could default to something more sensible (pFBA maybe). The other methods pretty much work like you described:
optimize_minimal_fluxes(model) # name is probably not optimal
flux_variability_analysis(model)
single_reaction_deletion(model, method="moma")
sample(model, 1000)
# etc.
I think I'll close this for now since a lot of that has made its way into cobrapy at this point :smile:. For instance, separate solutions, better encapsulation, less connections between object, config singleton, etc. Everybody, feel free to open a discussion for points that have not been addressed.