cobrapy icon indicating copy to clipboard operation
cobrapy copied to clipboard

Top-down restructuring of Model / Reaction / Metabolite objects

Open pstjohn opened this issue 8 years ago • 8 comments

So this was brought up in #303 and #389, it might be worth talking about the relative merits / inconvenience of having Reaction, Metabolite, and Gene objects aware of the model they're a part of.

If we removed the Model object from them, it might make copy operations much faster and code maintenance in general much more straight-forward. (and maybe remove the odd errors that require a call to Model.repair).Metabolites and Reactions could then unambiguously belong to a number of different models.

Downsides would be that calls to things like metabolite.reactions would have to get replaced with method calls in Model.py, using perhaps packages like networkx to quickly compute parent/child nodes on the fly. Also, methods like Reaction.flux, Metabolite.summary, etc. would all have to be changed somehow.

pstjohn avatar Feb 15 '17 15:02 pstjohn

@hredestig and I discussed at length today the merits and pit falls of reorganizing the cobrapy package completely. While we both agree that it'd be very nice to have, we're finding it hard to explain/quantify the value of spending the necessary amount of time on it. Some of our discussed pros were: mental clarity, proper encapsulation leading to better testing/ fewer bugs, ease of contribution for anyone, excitement about developing with/for a well structured, documented, and maintained open-source project. Some cons were: Major time sink, current version seems to fill the need, no feature addition, noone besides top contributors will benefit/notice.

I will spend some time tomorrow thinking about the complex dependencies between classes and functions. Maybe it's feasible to decide on a re-organization goal that can be achieved gradually.

Midnighter avatar Feb 16 '17 23:02 Midnighter

cobrapy Structure

General Remarks

  • design/concept whether to use object IDs or objects themselves as identifiers seems unclear
  • it would serve clarity to remove legacy interface completely, it seems like there is a lot of code rot
  • a new package for cobra plotting?

General Class Design Notes

  • all attributes should appear in __init__ (can be set to None at first)
  • instances should not modify other instances (if wanted for backwards compatibility, let methods call external functions
  • classes are for holding and accessing data/information
  • should only ever act on their own direct information
  • classes should be encapsulated, if a class wants to modify another, there needs to be a method for it, not act on its attributes directly!

cobra.core classes

  • DictList
    • aim seems to be behavior of a pandas.Series
    • query method could be something interesting
  • Formula
    • deprecated
    • definition of valid formula regex seems narrow
    • sum up molecular weight seems useful (units?!)
  • Gene
    • use a lexer instead of ast for gene reaction rules?
    • attribute functional
    • model integration rightly deprecated here, deferred to model manipulation
  • Metabolite
    • has taken over formula management
    • accesses current model.solution for primal, dual, and shadow prices
    • remove self from model and reactions
    • summary method uses cobra.flux_analysis.summary.metabolite_summary
  • Model
    • god class
    • modifies way too many other classes directly
  • Object
    • property id
    • attributes name, notes, annotation
    • checks for existence of _model in self.__dict__
  • Reaction
    • again gene reaction rule parser (lexer) could be useful
    • accesses variables in models and generate their IDs
    • flux expression
    • modify bounds, primal, and dual
    • remove self from model
    • delete a reaction from all its references
    • basic question: should reactions be (im)mutable? strong argument that modified reaction is a different reaction
    • build reaction from string
  • Solution
    • needs better optlang understanding
  • Species
    • Metabolite and Reaction both inherit from Species, why?

cobra.config

  • modifies root logger! that's horrible
  • there should be a config object such that it can be modified and used at runtime

cobra.design

  • belongs into cameo?

cobra.topology

  • still relevant?
  • should topology based metabolic investigations be a new package?

Midnighter avatar Feb 18 '17 15:02 Midnighter

This is a really nice list, thanks for thinking about it @Midnighter! Agreed, there's certainly some cleaning up to do.

I think we should tentatively milestone this as 0.7, as I do think we want to get the optlang interface working (if messily) and worry about optimizing / streamlining the code down the road.

Reaction doesn't seem to inherit from Species... I really think Species and Metabolite could get merged.

On DictList, there's definitely some customized functionality for cobrapy at this point. But a lot of it does mirror pandas.Series. Subclassing pandas objects used to be a huge pain, but it seems things may have gotten better at this point. Maybe we just inherit from a pandas.Series?

Whats about the summary methods? Is this a list of methods that require knowledge of the model and solution environments? Even metabolite.reactions (ab)uses that, so perhaps add that to the list.

pstjohn avatar Feb 18 '17 23:02 pstjohn

I had some more time on the train on the way home. In the following a few musings:

cobrapy Class Encapsulation

Details where class encapsulation is broken and other remarks.

  • should model, reaction, and metabolite inherit from optlang classes? (probably not)
  • think hard about interaction with solver
  • there seems to be a separation between model and solver interface that is not always intuitive/obvious

Metabolite

  • _set_id_with_model
    • checks model metabolites
    • assigns current model solver and modifies its constraint
    • regenerates index on model metabolites
  • constraint
    • get on model's current solver's constraint
  • formula_weight and the periodic table, smarter format to store values and units?
  • y
    • model solution y_dict
  • shadow_price
    • model solution attribute
  • remove_from_model
    • remove itself and potentially affected reactions from the model
  • summary
    • calls cobra.flux_analysis.summary.metabolite_summary seems a good intermediate solution

Reaction

General thought: should reactions be treated as immutable container types?

  • attributes for genes, metabolites, model
  • keeps track of bounds, variable type, forward and reverse variable
  • set_id_with_model
    • check model reactions
    • gets variables
    • regenerates reactions index
  • forward_variable and reverse_variable
    • access model's solver's variables
  • _reset_var_cache
  • set objective_coefficient
  • lower_bound and upper_bound
    • update variables
  • flux
    • gets variables' primal values
  • reduced_costs
    • gets variables' dual values
  • x
    • like flux
  • reversibility
    • depends on bounds
  • remove_from_model
    • modifies model, potentially metabolites
  • delete
    • similar to remove
  • update_forward_and_reverse_bounds
    • should reactions change solver problem?

Model

  • __setstate__
    • updates all contained objects
  • attributes for context, solver, interface, and solution
  • copy seems intrusive but maybe has to be
  • add_metabolites what's the point without reactions?
  • _populate_solver example of extra function
  • optimize
  • repair
  • summary good example...

Solution

  • rather than all the model/solver interactions, have solution be built from solver with current values?

Midnighter avatar Feb 20 '17 09:02 Midnighter

So if metabolites didn't have an attached model or reaction, Reactions could just be bags of metabolites and models could be bags of reactions. Model.metabolites could be generated on the fly if needed from a set union.

One concern is what happens when we call model.reaction.my_reaction.knock_out(). Context management as currently implemented would be bricked. And to update the solver, we'd have to check each reaction for possible changes, which would be a lot slower.

if we redid the context solution, one option could be to follow something like the model of pymc3, where reactions and metabolites could find the model (or solution) of interest on-the-fly by checking the global stack.

pstjohn avatar Feb 21 '17 18:02 pstjohn

Having reactions immutable would be hard. Especially with bounds and ids. I am fine with reactions having a link to a model and interacting this way with the solver object since reactions actually have more connections to the solver object than the Model class itself. In that sense cobra.Model is not so much a god-class but more a glue-class syncing a set of reactions to the solver.

cdiener avatar Feb 21 '17 19:02 cdiener

@Midnighter, here we go!

Hi all!

So, from our gitter discussion here are some points I would like to bring to the development:

Model attributes and inconsistent states

1. model.solution

This atribute doesn't make sense. If I solve the model, it will store a solution in an atribute inside the model. After that, the model can be manipulated, like changing bounds, adding reactions, changing the stoichiometry of reactions, etc. However, if I change the model state, the model.solution is no longer valid. My suggestion is that model.solution should be removed and it should be used as follows:

solution1 = model.optimize()
# do whatever modifications
# ...
solution2 = model.optimize()
# solution1 and solution2 are different and there is no confusion about what is inside the model.

2. What is a Reaction

When we look at Reaction, there is a model atribute. Also, the Model has a reactions atribute which is a DictList with reactions. In the add_reactions method, we change the Reaction.model atribute. So if I take a reaction from model1 and I add the same reaction to model2, the reaction is no longer in model1.

So, what is Reaction?

a) Is reaction a representation of any metabolic conversion? Or b) is the reaction a representation of a metabolic conversion in a specific model?

If a) reaction cannot have a model attribute. If b) we cannot add a reaction from a model and we need to build the reaction with a model. This raises implementation issues such as: if a) and the reaction can be in any model, then how do we represent the GPR. In the a) scenario, it should not be in the reaction.

Optimized ideas

We discussed the problem of how slow it is to get primals and duals from the solver. I think the implementation of optimize and/or solve should as simple as possible to start with. As we identify advance stuff, we can add protected methods, keyword arguments, etc. So, for optimize , my suggestion is:

(with @cdiener contribution) optimize(..., reactions=self.reactions) returns all fluxes as default. When reactions=[] returns no fluxes and when reactions=[r1, r2, ...] returns only the subset of fluxes.

I have a bit of mixed feeling here, so I'll share my ideas.

1.Other advanced keyword arguments like shadow_prices=True, reduced_cots=True, irreducibly_inconsistent_constraints=False can be set on the method. The default values would be to return the most complete solution for a regular user.

  1. Advanced implementations can interact with the solver directly to retrieve duals and primals and other information. Here my main concern is concurrent multithread implementations. We would need to be able to lock the model.

  2. This is the most disruptive idea: get rid of model.solve. Then if the user want a flux distribution he has to explicitly use a method to retrieve it: fba(model), pfba(model), moma(model) or my_brilliant_method(model). The implementation and access to the solver would be done through less exposed methods.

I hope that we can use this discussion forum give more stability and coherence to cobrapy implementation.

Cheers,

Joao

the-code-magician avatar Feb 25 '17 18:02 the-code-magician

Okay here are my 2 cents :sweat_smile:

Generally I think there are two topic contained here. One is the API design (1) and the other is the actual source code organisation underneath (2). I tend to think that API > code right now since (2) can be changed at any point without having to worry about versioning, whereas stabilizing the API is somewhat important given that cobrapy is now ~5 years old has a version < 1.0, e. g. no stable API.

Solution

I think we basically agreed that model.solution will be removed. Regarding what should be contained in a solution I'm kind of undecided. I think the idea to get only partial information is merited (I do not find myself checking shadow prices too often), however I would like to keep the API simple as well...

I performance should not be a design argument here. Most of the cobrapy algorithms use the optlang interface and other tricks to get around using the solution object. Also the performance of Solution is tricky to optimize anyways since even with an instantaneous optlang.Model.primal_values we still have to do something like

fluxes = {r.id: primals[r.forward_variable.name] - primal[r.reverse_variable.name] for r in model.reactions}

which in my benchmarks becomes the new bottleneck. So IMHO we should concentrate on the usability here and forget about performance for now.

Model - Reaction interleaving

I think this has already improved quite a lot. If you look at the (private) attributes of the Reaction class now you can see that Reaction no longer carries around attributes that are actually attributes of the solver such as _objective_coefficient, _lower_bound etc. All those are now accessed by methods on Reaction that act on the solver object using the appropriate methods from the solver. The only thing that remains are those methods, which (for me) are more like syntactic sugar to make model configuration easier. Note that all those methods throw the appropriate error when a Reaction is not connected to a model. The alternative would be to have all bounds setting, objective coefficient setting etc. managed by the Model class, which I think would make it less usable. Also there are some difficult cases like changing the reaction stoichiometry where it would be difficult to keep Model and Reactions in sync.

So I think having some methods on Reaction that can only be active when associated with a model is a good price to pay for the much nicer API (look at other packages where bounds have to be passed to the optimization methods and things like that).

Idea 3 from @joaocardoso

I get where you are coming from. The optimize method can be misleading, especially when getting all fluxes since they are pretty arbitrary. Given the centrality of optimize in cobrapy I think removing it would create problems but maybe in a far future it could default to something more sensible (pFBA maybe). The other methods pretty much work like you described:

optimize_minimal_fluxes(model)  # name is probably not optimal
flux_variability_analysis(model)
single_reaction_deletion(model, method="moma")
sample(model, 1000)
# etc.

cdiener avatar Feb 28 '17 17:02 cdiener

I think I'll close this for now since a lot of that has made its way into cobrapy at this point :smile:. For instance, separate solutions, better encapsulation, less connections between object, config singleton, etc. Everybody, feel free to open a discussion for points that have not been addressed.

cdiener avatar Nov 04 '22 18:11 cdiener