cobrapy icon indicating copy to clipboard operation
cobrapy copied to clipboard

Elementary Flux Modes and Minimum Cut Sets

Open pstjohn opened this issue 8 years ago • 29 comments

So I'll use this issue to kick off the discussion on EFMs and MCSs

The two methods I've had the most luck with in calculating EFMs and MCSs are efmtool, doi:10.1093/bioinformatics/btn401, and mhsCalculator, doi:10.1186/1471-2105-14-318.

efmtool is distributed under the Simplified BSD Style License in two formats, a compiled matlab extension (which includes java .jar files which can be run from the command line), and the java source code. It requires java 1.7. Also worth noting are the more recent modifications of efmtool by Christian Jungreuthmayer, including tEFMA and [regEfmtool)[http://www.biotec.boku.ac.at/en/arbeitsgruppenresearch-groups/research-group-mattanovich-gasser-sauer/staff/associated-group-metabolic-modelling/regefmtool/].

Is anyone on the cobrapy team familiar at all with java? (I'm not 😄 ) My current solution is a fairly quick and dirty wrapper that writes input files from a cobrapy model to a temporary directory, calls the compiled java code, and reads out the output files into a pandas dataframe. Its reliable enough for what I've used it for, but I'm not sure how that would best fit into the general cobrapy ecosystem.

mhsCalculator is a set of C codes which function similarly to efmtool, processing input reaction + stoichiometry files and outputting the minimum cut sets. This is also under the Simplified BSD License.

The wrapper functions seem to work fine, but I think the issues will come down to either providing install directions for the third-party tools or appropriately packaging the different code libraries. (And whether or not this is worth including in the core library or as a sub-library)

pstjohn avatar Oct 16 '16 23:10 pstjohn

My $.02 as a former developer is that it's much easier to have c dependencies.

You can make an extension in cython and link directly against the C API, and bundle the whole thing as an independent wheel with no additional dependencies.

On Oct 16, 2016 4:39 PM, "Peter St. John" [email protected] wrote:

So I'll use this issue to kick off the discussion on EFMs and MCSs

The two methods I've had the most luck with in calculating EFMs and MCSs are efmtool http://www.csb.ethz.ch/tools/software/efmtool.html, doi:10.1093/bioinformatics/btn401 http://dx.doi.org/10.1093/bioinformatics/btn401, and mhsCalculator http://www.biotec.boku.ac.at/en/arbeitsgruppenresearch-groups/research-group-mattanovich-gasser-sauer/staff/associated-group-metabolic-modelling/mhscalculator/, doi:10.1186/1471-2105-14-318 http://www.biomedcentral.com/1471-2105/14/318.

efmtool is distributed under the Simplified BSD Style License in two formats, a compiled matlab extension (which includes java .jar files which can be run from the command line), and the java source code. It requires java 1.7. Also worth noting are the more recent modifications of efmtool by Christian Jungreuthmayer, including tEFMA https://github.com/mpgerstl/tEFMA and [regEfmtool)[http://www. biotec.boku.ac.at/en/arbeitsgruppenresearch-groups/ research-group-mattanovich-gasser-sauer/staff/associated- group-metabolic-modelling/regefmtool/].

Is anyone on the cobrapy team familiar at all with java? (I'm not 😄 ) My current solution is a fairly quick and dirty wrapper that writes input files from a cobrapy model to a temporary directory, calls the compiled java code, and reads out the output files into a pandas dataframe. Its reliable enough for what I've used it for, but I'm not sure how that would best fit into the general cobrapy ecosystem.

mhsCalculator is a set of C codes which function similarly to efmtool, processing input reaction + stoichiometry files and outputting the minimum cut sets. This is also under the Simplified BSD License.

The wrapper functions seem to work fine, but I think the issues will come down to either providing install directions for the third-party tools or appropriately packaging the different code libraries. (And whether or not this is worth including in the core library or as a sub-library)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/opencobra/cobrapy/issues/293, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUPSIEJ8ZZ6mmhLdtE2vK9mreXERg2Yks5q0rXHgaJpZM4KYH6S .

aebrahim avatar Oct 17 '16 00:10 aebrahim

That was my thought as well. Sadly though it seems like efmtool and its various offshoots are all written in java. To do the minimum cut set calculation, you'll need to start with the calculated EFMs, so just including mhsCalculator alone doesn't add much functionality.

pstjohn avatar Oct 17 '16 01:10 pstjohn

To get extreme rays of polytopes there is also the classic ccd which has python bindings. In C there is also lrs. Both use the double description method as efmtool but do not use the bit vector optimization. When I tried the command line versions ccd+ was actually faster than lrs for me. However, both take a long time even for small models (60 reversible reactions, 80.000 basis vectors, 1-2 hours)...

Many of those packages are compatible with Fukuda's original input format (no idea about efmtool), so we could also just write a function that outputs the problem definition and the user just uses whatever he likes...

cdiener avatar Oct 17 '16 18:10 cdiener

Another way that efmtool seems to speed up the calculation is by a reversible compression of the metabolic network to the simplest possible representation. That might also have utility in other cobrapy functions, so that could be something worth looking into implementing

pstjohn avatar Oct 17 '16 18:10 pstjohn

~~Does that mean deleting redundancies (linearly dependent rows) in the H-representation? I think all of the methods do/recommend that.~~ Nevermind, just skimmed the paper :) Something we could also think about is to use flux sampling and than calculate a convex hull for the sample which gives you an approximate basis. Scipy already has a convex hull function so we would not need additional dependencies. However, I have no idea how good this basis would be.

cdiener avatar Oct 17 '16 18:10 cdiener

Having used scipy's convex hull function, I'd probably avoid that method 😄 . It definitely struggles with more than just a few dimensions. There's also a lot of research in this area, I'd shy away from trying to come up with a new method. ccd and lrs seem promising, I'll look into them more.

I do think the network expansion / compression would be an important tool, I might create a new issue to ask about it.

I'll use this issue to keep track of some alternative algorithms and implementations:

  • Metatool & elmo-comp, for some reason I remember seeing this before and not wanting to use it...
  • A depth-first search algorithm to compute elementary flux modes by linear programming 10.1186/s12918-014-0094-2
  • TreeEFM: calculating elementary flux modes using linear optimization in a tree-based algorithm 10.1093/bioinformatics/btu733

pstjohn avatar Oct 18 '16 18:10 pstjohn

Yeah I have encountered the same problem with convex hulls. Sometimes reducing dimensions by SVD and reprojecting after calculating the hull helps. Either way I would not really implement it explicitly since it's just plugging the samples into the convex hull function of scipy. Maybe as an example in the docs...

cdiener avatar Oct 18 '16 18:10 cdiener

So one option here could be a docker container.. that would obviously require a separate package, but it might be a relatively easy way of shipping a java 7 runtime for EFMtool and a compiled MCScalculator package. Would anyone find that useful? Or is the docker dependency a nonstarter?

pstjohn avatar Oct 30 '16 21:10 pstjohn

Ok, we already have functions for calculating shortest EFMs and cut sets here https://github.com/biosustain/cameo/blob/devel/cameo/flux_analysis/structural.py that do not have any special dependencies. Two reasons why I think this should go in a separate package: (1) Enumerating all EFMs has no practical value for genome-scale models and I think cobrapy is mainly intended for genome-scale model. Furthermore, everything that has dependencies that go beyond C extensions should not go into cobrapy. Why not make a separate package?

phantomas1234 avatar Nov 02 '16 08:11 phantomas1234

Do you happen to have a good reference for the shortest EFMs and MCSs? It would be cool to compare and see whats left out for core-carbon models.

pstjohn avatar Nov 02 '16 14:11 pstjohn

@pstjohn http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003378

phantomas1234 avatar Nov 08 '16 08:11 phantomas1234

Cool, I had actually just read through that. Definitely seems like a promising approach, even for smaller models. We should do another cobrapy hangout to talk about the optlang interface, it would help me figure out what you guys were doing in cameo/cameo/flux_analysis/structural.py

pstjohn avatar Nov 08 '16 14:11 pstjohn

The paper on shortest EFMs is available here. The initial code was written in Java and needs rewriting. I'm happy to help.

Newer methods to calculate EFMs in genome-scale networks involved sampling of EFM either using genetic algorithms or a tree like search, e.g. CASOP-GS or TreeEFM. In order to use them in genome-scale networks one can formulate the EFM calculation problem as an LP.

I am being a bit bias here, but I do see value in enumerating not all but a subset of EFMs for genome-scale models, for example, for data integration. I am sure there are other areas where the enumeration of EFMs can be applied to.

LuisFF avatar Nov 09 '16 00:11 LuisFF

Hey Luis (what a nice surprise!), sorry for not providing the original publication earlier. We actually have a shortest EFM implementation here in python https://github.com/biosustain/cameo/blob/devel/cameo/flux_analysis/structural.py.

phantomas1234 avatar Nov 09 '16 09:11 phantomas1234

Hey Luis, thanks for commenting!

I looked into TreeEFM, seemed like a neat tool! Its written in C++ too, so that could be fairly easy to incorporate. Do you know if the source code might be available? -- the original publication only distributes binaries.

@phantomas1234, I had a tough time figuring out how you implemented the shortest EFM algorithm, or how easy it would be to port to the optlang interface. Do you know how much of a speed detriment there is for the python over a java/c++ approach? I know the original algorithm made pretty heavy use of CPLEX's populate, does your version work with other solvers too? (and if so, do we loose the benefit of populate?)

pstjohn avatar Nov 09 '16 17:11 pstjohn

@phantomas1234 Nice implementation of the K-shortest! Another approach we explored was to use a more restrictive exclusion_constraint by bounding it by len(exclusion_list) - w, where w can be any integer higher than 1. The higher w is the more distinct (containing less reactions in common) the newly calculated EFMs are.

@pstjohn Fell free to contact Francis Planes at CEIT, I am sure he will be happy to collaborate.

LuisFF avatar Nov 09 '16 22:11 LuisFF

The k-shortest variations seem very cool. That is definitely something we should consider!

cdiener avatar Nov 11 '16 18:11 cdiener

@pstjohn The shortest EFM implementation in cameo is already fully compatible with optlang and can be used with any MILP solver. In addition it implements cplex-specific subroutines that make use of the populate function to improve performance. This performance gain will indeed be lost with solvers that don't have a feature similar to populate. A very large majority of the running time of shortest EFM is spent solving MILP's (optimized solver functions in C) so I don't think there will be a noticeable difference between a python and java/c++ implementation.

KristianJensen avatar Nov 14 '16 20:11 KristianJensen

@pstjohn can we close this leaving EFM to the cameo implementation / other packages? Or was there still something that you think should go in cobrapy?

hredestig avatar Dec 19 '16 10:12 hredestig

I still think it would be great if we could port it to cobrapy eventually, its a feature I certainly would use frequently.

Maybe not as urgent as the 0.6.0 merge, probably better to let the optlang interface in cobrapy stabilize a bit.

pstjohn avatar Dec 19 '16 15:12 pstjohn

I would support moving the shortest EFM/minimal cut sets to cobrapy eventually.

phantomas1234 avatar Dec 22 '16 10:12 phantomas1234

It would also be awesome to add elementary flux vectors at some point as well, in particular to be able to integrate them with arbitrary optlang constraints: S. Klamt et al., PLoS Comput Biol. 13, e1005409–22 (2017)

pstjohn avatar May 05 '17 15:05 pstjohn

It would be great if we could get people from the EFM crowd involved in cobrapy development.

phantomas1234 avatar May 19 '17 08:05 phantomas1234

Agreed! Although for EFVs specifically, its actually fairly easy to implement: I have it more or less working in my pyefm package. For cobrapy, we would probably want to do the k-shortest method... but we also need to implement some sort of network compression / decompression as well.

pstjohn avatar May 19 '17 12:05 pstjohn

So for the k-shortest method, the original authors use integer fluxes only and mention that the method is more expensive using non-integer fluxes.

Do we have an opinion as to whether or not we want the fluxes to be allowed to take non-integer values? I don't have an immediate insight into whether or not that makes EFV or MCS calculation harder or easier..

pstjohn avatar Oct 18 '17 13:10 pstjohn

Will depend on #612. WIP at feat/elementary_flux_modes

pstjohn avatar Oct 18 '17 18:10 pstjohn

I wonder though if EFMs and related plus the tooling surrounding it would be a good target for a small cobrapy dependent (sub-) package?

Midnighter avatar Oct 19 '17 11:10 Midnighter

There's already pstjohn/pyefm :). But i think if we had a k-shortest implementation using pure optlang it would fit nicely in the existing package.

Certainly open to discussion though

pstjohn avatar Oct 19 '17 11:10 pstjohn

@pstjohn Requiring fluxes to be integers only works when all the coefficients of the S-matrix are integers. Since this is usually not the case for genome-scale models I think it makes most sense to allow non-integer fluxes by default. If it is significantly faster to have a pure integer problem, then I guess it would be fairly easy to add a switch argument to the function/object or automatically check if the S-matrix is integer

KristianJensen avatar Oct 19 '17 19:10 KristianJensen

Closing because that discussion is 5 years old. Feel free to open a thread in the Discussion section. Either way it seems that any solution (part of the core, or separate packages) seem to have support, though it's more a lack of contributors willing to implement it.

cdiener avatar Nov 04 '22 19:11 cdiener