alchemical-analysis icon indicating copy to clipboard operation
alchemical-analysis copied to clipboard

General suggestions for code cleanup/improvements

Open halx opened this issue 8 years ago • 33 comments

There is currently a lot of cruft growing in the code which is basically down to design choices.

Here some suggestions: The parsers should be designed in a "polymorphic" fashion such that the main code does not need to know anything about the individual parser codes. This implies that a clear interface and a specification must be designed to define what data is passed in and what is passed out and in what form. It is the responsibility of the parser code to signal back what it can and cannot do. E.g., as it stands now, dhdlt should be None if MD code can't do TI or u_klt should be None if it *BAR analysis can't be done. Energies should be converted to a "canonical" form or it should be signalled back to the main code what the units are. This will relieve the main code maintainer to do code-specific clean-up operations.

Optparse should be replaced with argparse.

The globals need to go as this is really bad programming style. The problem with this is that all these variables are accessible throughout all of the module which mean that they can be modified anywhere. This can create hard to track down bugs (all you need is a second person messing around there...).

halx avatar Dec 13 '15 16:12 halx

One practical thought on this. I think the code should have an understanding of what the various analysis methods "mean". Not sure if there's something in place but e,g,

all_methods = set(['TI','TI-CUBIC','DEXP','IEXP','GINS','GDEL','BAR','UBAR','RBAR','MBAR']) ti_methods = set(['TI','TI-CUBIC']) bar_methods = set(['DEXP','IEXP','GINS','GDEL','BAR','UBAR','RBAR','MBAR'])

other methods if applicable

selected_method = set(P.methods) # or via argparse

if not selected_method & bar_methods: # intersection do_bar = None

if u_klt is None: selected_method = selected_method - bar_methods

etc.

halx avatar Dec 18 '15 18:12 halx

Yes, I agree with all of this, and most especially the issue relating to global variables. Probably we should break up this issue into several separate issues which make specific proposals for code changes that can be handled relatively independently, i.e.:

  • replace optparse with argparse (see also #51)
  • decide canonical form for energies
  • discuss and determine interface for parsers (specify what data goes in and what comes out so that parsers can be independent/polymorphic) and what units are
  • rewrite parsers to conform to interface specifications and use canonical units
  • eliminate global variables
  • make code know what class different methods are in, and use this to (???) (simplify plotting decisions, ...?)

davidlmobley avatar Dec 18 '15 19:12 davidlmobley

Ok, that's certainly a plan but the practical problem is that this affetcs one single file which in turn means conflicts in parallel development. For the near future this will be just as it is but in the long run it may make sense to split alchemical_analysis.py in smaller but logical chunks. In this context I would also suggest to develop the tool into a proper library with a well-defined API. alchemical_analysis.py will then just be the command line frontend. Doing this will also mean that writing to stdout has to be modified e.g. through a logger that explicitly needs to be switched on for the command line tool but would not output anything by default. There may be other possibilities.

On 18 December 2015 at 19:09, David L. Mobley [email protected] wrote:

Yes, I agree with all of this, and most especially the issue relating to global variables. Probably we should break up this issue into several separate issues which make specific proposals for code changes that can be handled relatively independently, i.e.:

  • replace optparse with argparse
  • decide canonical form for energies
  • discuss and determine interface for parsers (specify what data goes in and what comes out so that parsers can be independent/polymorphic) and what units are
  • rewrite parsers to conform to interface specifications and use canonical units
  • eliminate global variables
  • make code know what class different methods are in, and use this to (???) (simplify plotting decisions, ...?)

— Reply to this email directly or view it on GitHub https://github.com/MobleyLab/alchemical-analysis/issues/48#issuecomment-165873053 .

halx avatar Jan 04 '16 09:01 halx

Ok, I have created a branch 'refactor'. Done so far.

  • dynamic loading of parsers: need to discuss the interface
  • replaced optparse with argparse
  • eliminated globals in main()

halx avatar Jan 04 '16 12:01 halx

OK. It may take me a couple of days to get to review this one. What needs to be discussed regarding parsers?

Thanks.

On Mon, Jan 4, 2016 at 4:00 AM, Hannes Loeffler [email protected] wrote:

Ok, I have created a branch 'refactor'. Done so far.

  • dynamic loading of parsers: need to discuss the interface
  • replaced optparse with argparse
  • eliminated globals in main()

— Reply to this email directly or view it on GitHub https://github.com/MobleyLab/alchemical-analysis/issues/48#issuecomment-168659004 .

David Mobley [email protected] 949-385-2436

davidlmobley avatar Jan 04 '16 16:01 davidlmobley

The interface i.e. inputs and outputs. At the moment P is passed in which is, frankly, a rather "heavy" data structure. It also seems to be (ab?)used to also contain additional data items beyond what arparse puts in there. The current code heavily relies on this data structure to be present globally so it is in essence another global. I would tend to only pass around data that is really needed. Further cleanup of the code and separation into smaller units will make this obvious.

On the output side the code is mostly where it should be. But we will need a mechanism to report back the parsers "capabilities". Maybe a separate data structure that is compared to other data times. e.g. P.methods to determine what analysis can and should be done.

Overall, I would think this is not high priority as it will become clearer during refactoring what needs to be done. The new code will need heavy testing before a merge and release. I tend to only test the AMBER side because I am most familiar with it but I will have a look through the example date as well to see if I haven't broken the code in an obvious way.

BTW, what is the need to take extras steps for the Desmond parser? Looking into uncorrelate() it seems there is effectively only one line of code different. Can't this be done by the parser itself?

halx avatar Jan 04 '16 18:01 halx

I think I have an idea how units should be handled. Essentially, the parsers fill up dhdlt and u_klt and report back what units they are in. The parsers should do no further conversion but will need to also report the temperature to compute beta. All conversion are then done in the main code as required and requested by the user

halx avatar Jan 07 '16 16:01 halx

That sounds reasonable. But I should investigate a bit on HOW the units should be reported back. For example, there is a units module used in openmm and some other places which actually can make objects CARRY units in a pretty convenient way. But I need to investigate whether having these units attached to return data would slow things down or not.

Otherwise I agree with you though.

On Thu, Jan 7, 2016 at 8:01 AM, Hannes Loeffler [email protected] wrote:

I think I have an idea how units should be handled. Essentially, the parsers fill up dhdlt and u_klt and report back what units they are in. The parsers should do no further conversion but will need to also report the temperature to compute beta. All conversion are then done in the main code as required and requested by the user

— Reply to this email directly or view it on GitHub https://github.com/MobleyLab/alchemical-analysis/issues/48#issuecomment-169707154 .

David Mobley [email protected] 949-385-2436

davidlmobley avatar Jan 08 '16 01:01 davidlmobley

Following the duscussion on https://github.com/pandegroup/openmm/issues/680 it doesn't look to me that the current maintainer is much interested in making this module available in a sensible way (i.e. standalone). Pint has been suggested in the thread. I'll take a look at that one and see if it is interesting enough.

On 8 January 2016 at 01:29, David L. Mobley [email protected] wrote:

That sounds reasonable. But I should investigate a bit on HOW the units should be reported back. For example, there is a units module used in openmm and some other places which actually can make objects CARRY units in a pretty convenient way. But I need to investigate whether having these units attached to return data would slow things down or not.

Otherwise I agree with you though.

On Thu, Jan 7, 2016 at 8:01 AM, Hannes Loeffler [email protected] wrote:

I think I have an idea how units should be handled. Essentially, the parsers fill up dhdlt and u_klt and report back what units they are in. The parsers should do no further conversion but will need to also report the temperature to compute beta. All conversion are then done in the main code as required and requested by the user

— Reply to this email directly or view it on GitHub < https://github.com/MobleyLab/alchemical-analysis/issues/48#issuecomment-169707154

.

David Mobley [email protected] 949-385-2436

— Reply to this email directly or view it on GitHub https://github.com/MobleyLab/alchemical-analysis/issues/48#issuecomment-169861131 .

halx avatar Jan 08 '16 10:01 halx

I had a quick look into pint and it looks workable. Possible downsides: this approach means replacing all 'raw' values with unit objects. This could mean more memory and also being slower. By how much I don't know but memory was a concern of Pavel's.

On 8 January 2016 at 10:15, Hannes Loeffler [email protected] wrote:

Following the duscussion on https://github.com/pandegroup/openmm/issues/680 it doesn't look to me that the current maintainer is much interested in making this module available in a sensible way (i.e. standalone). Pint has been suggested in the thread. I'll take a look at that one and see if it is interesting enough.

On 8 January 2016 at 01:29, David L. Mobley [email protected] wrote:

That sounds reasonable. But I should investigate a bit on HOW the units should be reported back. For example, there is a units module used in openmm and some other places which actually can make objects CARRY units in a pretty convenient way. But I need to investigate whether having these units attached to return data would slow things down or not.

Otherwise I agree with you though.

On Thu, Jan 7, 2016 at 8:01 AM, Hannes Loeffler <[email protected]

wrote:

I think I have an idea how units should be handled. Essentially, the parsers fill up dhdlt and u_klt and report back what units they are in. The parsers should do no further conversion but will need to also report the temperature to compute beta. All conversion are then done in the main code as required and requested by the user

— Reply to this email directly or view it on GitHub < https://github.com/MobleyLab/alchemical-analysis/issues/48#issuecomment-169707154

.

David Mobley [email protected] 949-385-2436

— Reply to this email directly or view it on GitHub https://github.com/MobleyLab/alchemical-analysis/issues/48#issuecomment-169861131 .

halx avatar Jan 08 '16 10:01 halx

OK, thanks. I did put a follow-up question in that thread. I've used simtk.unit relatively recently but it sounds like that's a bad idea (I don't want to make this require simtk), so looking into pint probably makes sense.

On Fri, Jan 8, 2016 at 2:15 AM, Hannes Loeffler [email protected] wrote:

Following the duscussion on https://github.com/pandegroup/openmm/issues/680 it doesn't look to me that the current maintainer is much interested in making this module available in a sensible way (i.e. standalone). Pint has been suggested in the thread. I'll take a look at that one and see if it is interesting enough.

On 8 January 2016 at 01:29, David L. Mobley [email protected] wrote:

That sounds reasonable. But I should investigate a bit on HOW the units should be reported back. For example, there is a units module used in openmm and some other places which actually can make objects CARRY units in a pretty convenient way. But I need to investigate whether having these units attached to return data would slow things down or not.

Otherwise I agree with you though.

On Thu, Jan 7, 2016 at 8:01 AM, Hannes Loeffler < [email protected]> wrote:

I think I have an idea how units should be handled. Essentially, the parsers fill up dhdlt and u_klt and report back what units they are in. The parsers should do no further conversion but will need to also report the temperature to compute beta. All conversion are then done in the main code as required and requested by the user

— Reply to this email directly or view it on GitHub <

https://github.com/MobleyLab/alchemical-analysis/issues/48#issuecomment-169707154

.

David Mobley [email protected] 949-385-2436

— Reply to this email directly or view it on GitHub < https://github.com/MobleyLab/alchemical-analysis/issues/48#issuecomment-169861131

.

— Reply to this email directly or view it on GitHub https://github.com/MobleyLab/alchemical-analysis/issues/48#issuecomment-169954135 .

David Mobley [email protected] 949-385-2436

davidlmobley avatar Jan 08 '16 18:01 davidlmobley

I have addressed most of these except for th unit problem. The harder part will now be to separate the current functions into more useful chunks (probably by means of analysis method and separte plotting from non-plotting functionality) and a library.

I still not convinced that the tool needs a unit managment via a specialised module. Do you expect future functionality that is more than energy conversion at one single point in the code?

halx avatar Jan 13 '16 15:01 halx

I have addressed most of these except for th unit problem. The harder part will now be to separate the current functions into more useful chunks (probably by means of analysis method and separte plotting from non-plotting functionality) and a library.

Great. Splitting in the way you describe also makes sense, I think.

I still not convinced that the tool needs a unit managment via a specialised module. Do you expect future functionality that is more than energy conversion at one single point in the code?

I am not totally convinced that it does either, though I think in general I'll probably try to be including these from scratch when I write new code that will be dealing with multiple sets of units - simply because having everything carry units makes it SO MUCH easier to make sure you have appropriate units when/where needed.

davidlmobley avatar Jan 13 '16 16:01 davidlmobley

My main worry was that this wouldn't work well with the numpy arrays that we use now. A numpy array can only contain certain data types (a "raw" number) and wrapping an array would cost some time and effort. Also, that would have multiplied memory storage even if it was possible.

I should have taken a closer look though and read http://pint.readthedocs.org/en/0.6/numpy.html . So you can actually wrap a numpy array together with a unit and do all the stuff you can do with with scalars! Ok, maybe you cannot do any transformation of an array you can do with numpy but simple unit conversion is possible. So after all that doesn't look so bad.

On 13 January 2016 at 16:59, David L. Mobley [email protected] wrote:

I have addressed most of these except for th unit problem. The harder part will now be to separate the current functions into more useful chunks (probably by means of analysis method and separte plotting from non-plotting functionality) and a library.

Great. Splitting in the way you describe also makes sense, I think.

I still not convinced that the tool needs a unit managment via a specialised module. Do you expect future functionality that is more than energy conversion at one single point in the code?

I am not totally convinced that it does either, though I think in general I'll probably try to be including these from scratch when I write new code that will be dealing with multiple sets of units - simply because having everything carry units makes it SO MUCH easier to make sure you have appropriate units when/where needed.

— Reply to this email directly or view it on GitHub https://github.com/MobleyLab/alchemical-analysis/issues/48#issuecomment-171363987 .

halx avatar Jan 13 '16 19:01 halx

Ah, that's a good point. Pint does indeed look promising.

On Wed, Jan 13, 2016 at 11:32 AM, Hannes Loeffler [email protected] wrote:

My main worry was that this wouldn't work well with the numpy arrays that we use now. A numpy array can only contain certain data types (a "raw" number) and wrapping an array would cost some time and effort. Also, that would have multiplied memory storage even if it was possible.

I should have taken a closer look though and read http://pint.readthedocs.org/en/0.6/numpy.html . So you can actually wrap a numpy array together with a unit and do all the stuff you can do with with scalars! Ok, maybe you cannot do any transformation of an array you can do with numpy but simple unit conversion is possible. So after all that doesn't look so bad.

On 13 January 2016 at 16:59, David L. Mobley [email protected] wrote:

I have addressed most of these except for th unit problem. The harder part will now be to separate the current functions into more useful chunks (probably by means of analysis method and separte plotting from non-plotting functionality) and a library.

Great. Splitting in the way you describe also makes sense, I think.

I still not convinced that the tool needs a unit managment via a specialised module. Do you expect future functionality that is more than energy conversion at one single point in the code?

I am not totally convinced that it does either, though I think in general I'll probably try to be including these from scratch when I write new code that will be dealing with multiple sets of units - simply because having everything carry units makes it SO MUCH easier to make sure you have appropriate units when/where needed.

— Reply to this email directly or view it on GitHub < https://github.com/MobleyLab/alchemical-analysis/issues/48#issuecomment-171363987

.

— Reply to this email directly or view it on GitHub https://github.com/MobleyLab/alchemical-analysis/issues/48#issuecomment-171408097 .

David Mobley [email protected] 949-385-2436

davidlmobley avatar Jan 15 '16 22:01 davidlmobley

FWIW, seconding all the the points raised in this discussion. I think it's a huge boon if tools written in Python can actually be used directly from a Python session, but as currently written the core component of this module is only really usable as a command-line script. What's more, the code is extremely hard to follow, with e.g. functions that operate on names defined outside of their scope (alchemical_analysis.uncorrelate is one example).

Given this, I'm currently ripping out the things I need from it manually to get into a form I can actually use. I'll push the result to GitHub and perhaps that can help to move this issue along. I think this library addresses an important void, but could use some heavy refactoring.

dotsdl avatar Jun 15 '16 20:06 dotsdl

Hi @dotsdl, Before you go and refactor the code, there is currently a PR (#68) where @halx has done lots of work in refactoring the code. Nobody has had the chance to review it yet and thus hasn't been merged yet

nathanmlim avatar Jun 15 '16 20:06 nathanmlim

@dotsdl - I'm very on board with major changes; basically there's a ton that needs to be done along those lines and unfortunately I haven't had anyone in the group who is actively doing organization/improvements/maintenance right now (though @limn1 will probably be getting involved). @halx has done a lot that we didn't get to review/discuss yet. Perhaps a call would be warranted?

Nobody has had the chance to review it yet and thus hasn't been merged yet

Elaborating - I've had a bit of a look but we haven't really tested much yet, and it raises larger architecture issues/refactoring issues that I haven't been able to think through.

Maybe we should do a separate issue to think through the best ways to do this. Or we could schedule a call. Interested in being involved? @halx , are you still thinking about this and interested in being involved?

My group doesn't have enough real "code design" experience, I think, yet, so I need to involve outside expertise. The first version of this that we "released" was way too much done just by one student who didn't have the needed expertise.

@mrshirts has also mentioned that he's interested in being more involved going forward.

davidlmobley avatar Jun 15 '16 21:06 davidlmobley

@dotsdl - do please give us what you end up with regardless, as even if you're not involved in the larger "where do we go from here" discussion, at least seeing what you do/having your summary of it will help us figure out how to do this best.

Thanks.

On Wed, Jun 15, 2016 at 1:57 PM, Nathan Lim [email protected] wrote:

Hi @dotsdl https://github.com/dotsdl, Before you go and refactor the code, there is currently a PR (#68 https://github.com/MobleyLab/alchemical-analysis/pull/68) where @halx https://github.com/halx has done lots of work in refactoring the code. Nobody has had the chance to review it yet and thus hasn't been merged yet

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/MobleyLab/alchemical-analysis/issues/48#issuecomment-226317770, or mute the thread https://github.com/notifications/unsubscribe/AGzUYb2X-5sFETJbX73C9yq49Ped0PlFks5qMGc7gaJpZM4G0Vg5 .

David Mobley [email protected] 949-385-2436

davidlmobley avatar Jun 15 '16 21:06 davidlmobley

Hi, all- I'm definitely happy to help review code, especially at the physical correctness and functionality level. I'm not as experienced in larger architecture levels.

I still not convinced that the tool needs a unit managment via a specialised module. Do you expect future functionality that is more than energy conversion at one single point in the code?

I don't think a general unit manager will be needed. There is a very low probability of ever needing anything other than energy units of either kJ/mol, kcal/mol, or kT. Something specialized for this purpose should be fine.

mrshirts avatar Jun 15 '16 21:06 mrshirts

@davidlmobley I'm happy to be involved in this process. I think converging on a common codebase for doing low-level things like parsing output files from different MD engines and extracting the correct quantities for FEP/TI is important for progress in the larger field, since these should be entirely solvable problems. For my needs at the moment I only need to extract gromacs outputs, so I'll be focused on that.

I do have more of a software engineering focus, so I'm happy to help in terms of library architecture. As I said, I'll push the library that results from my work to a repo on GitHub and we'll see if a path forward emerges from it. We'll also then have something concrete to iterate on. That work will definitely require review from @davidlmobley and @mrshirts for physical correctness, since I'm less well-versed in caveats/pitfalls of alchemical free energy calculations.

dotsdl avatar Jun 15 '16 22:06 dotsdl

Yes, I have started with some light-weight changes in the code and I would think we should use this as a basis for discussion.

halx avatar Jun 16 '16 04:06 halx

@dotsdl , @halx - I've had this on hold for a while because of a super crazy summer (including a new baby, a visitor on sabbatical, etc.) and current grant proposal deadlines. But I'll be able to deal a bit with it again after the first week in October. Could we talk around that time to develop a better plan for the future, perhaps also with @mrshirts ?

davidlmobley avatar Sep 19 '16 22:09 davidlmobley

@davidlmobley absolutely. I'm in the midst of writing my dissertation (defending Nov. 4), so of limited use on development fronts. But been working on a library called tentatively alchemlyb that contains the machinery I used to do data munging in Python from GROMACS output and do MBAR and TI with what appear to be best-practices.

I pushed this library to GitHub today, but the API is far from stable, as it's mostly things I built over the course of the past couple of months. Starting November I and two others in our lab will be employed to get this library in a form that is easy for everyone to use, complete with tests, docs, etc., as we have a need for a central library of solid implementations for the different alchemistry projects we are doing ourselves. We will definitely be interested in both you and @mrshirts input to ensure our implementations are scientifically and statistically sound.

We hope to keep the library simple and straightforward enough that it can be used both interactively and as part of larger postprocessing pipelines. The goal is simplicity and clarity. Should we set up a time to discuss general needs more directly?

dotsdl avatar Sep 20 '16 03:09 dotsdl

@dotsdl - this actually sounds really interesting. I'd been thinking to some extent that what this really needs is to be re-framed into a library format rather than a stand-alone tool which attempts to do everything and keeps having to get extended to fit with different codes. There might then still be a front end which would interact with different tools...

Maybe move this to e-mail so we can schedule, looping in @mrshirts and @halx ?

davidlmobley avatar Sep 20 '16 04:09 davidlmobley

@davidlmobley happy to! You can reach me at [email protected]. Feel free to shoot me a message and we'll get rolling.

dotsdl avatar Sep 20 '16 04:09 dotsdl

Yes, let's got with email to arrange a discussion in October.

halx avatar Sep 20 '16 07:09 halx

Looks like I haven't paid much attention to this recently... Where are things standing at the moment?

halx avatar Dec 14 '16 11:12 halx

@dotsdl - any updates? Should we talk again? I'm still much more enthusiastic about the library approach you were talking about working on, but I haven't seen much action there, and obviously we'll need to do something fairly soon.

davidlmobley avatar Dec 14 '16 16:12 davidlmobley

@davidlmobley I've been working on a proof-of-concept set of components and structure for alchemlyb this past week, and I'm hoping to be able to put both the API proposal that it demonstrates as well as a gist showing how it works in practice during the coming week. Speaking to @orbeckst, we also wanted to draft a white paper that clearly states the problem, aims, and our solution once consensus on the proposal is reached on the issue tracker.

I'll post the issue today and follow it up with the detailed proposal + gist in the coming days. Apologies for the delay.

dotsdl avatar Dec 14 '16 17:12 dotsdl