openff-toolkit icon indicating copy to clipboard operation
openff-toolkit copied to clipboard

GLOBAL_TOOLKIT_REGISTRY does not behave as anticipated

Open j-wags opened this issue 5 years ago • 19 comments

Reported by @SimonBoothroyd

Describe the bug Changing GLOBAL_TOOLKIT_REGISTRY should allow the user to change the the cheminformatics toolkit that's used on the backend. However, this doesn't seem to work

To Reproduce

from openforcefield.topology import Molecule
from openforcefield.utils.toolkits import ToolkitRegistry, RDKitToolkitWrapper, GLOBAL_TOOLKIT_REGISTRY
mol = Molecule.from_smiles('CC')
mol.generate_conformers()
print(mol.conformers[0][0])

[ 0.81511319 -0.53833634 0.49281269] A

These are the coordinates of the first atom in the first conformer when OpenEye generates it

mol.generate_conformers(toolkit_registry=RDKitToolkitWrapper())
print(mol.conformers[0][0])

[-0.74552317 0.04144446 0.01170607] A

These are the coordinates of the first atom of the first conformer when RDKit makes the conformer

GLOBAL_TOOLKIT_REGISTRY = ToolkitRegistry(toolkit_precedence=[RDKitToolkitWrapper])
#import openforcefield.topology.molecule
print(GLOBAL_TOOLKIT_REGISTRY.registered_toolkits)

[<openforcefield.utils.toolkits.RDKitToolkitWrapper object at 0x118bc65f8>]

I am able to override the imported GLOBAL_TOOLKIT_REGISTRY to only have RDKit

mol.generate_conformers()
print(mol.conformers[0][0])

[ 0.81511319 -0.53833634 0.49281269] A

But this function still generates the OpenEye conformer

Additional information

The GLOBAL_TOOLKIT_REGISTRY can be imported from a number of modules -- for example:

  • openforcefield.topology.molecule.GLOBAL_TOOLKIT_REGISTRY
  • openforcefield.utils.GLOBAL_TOOLKIT_REGISTRY
  • openforcefield.topology.molecule.openforcefield.utils.toolkits.GLOBAL_TOOLKIT_REGISTRY

In my testing, changing any one of these doesn't appear to affect the others.

I'm concerned that what's really happening here is that a new GLOBAL_TOOLKIT_REGISTRY is being imported for each module. I haven't worked with this pattern of globally-shared objects before, so I'm not sure how to proceed.

@jaimergp suggested using the context manager approach, where we could use something like with using_toolkit(RDKit): ... to define blocks where certain toolkits or registries are used. This seems much smoother than tweaking a "GLOBAL" variable every time we want different defaults. I'm not sure how we'll choose to solve this issue, but this may be a convenient feature to add/test if this issue initiates a larger refactor.

j-wags avatar Jan 21 '20 21:01 j-wags

I think the first question to address is whether GLOBAL_TOOLKIT_REGISTRY is actually a singleton or not. The observed behavior seems to indicate that it might not be. Whether we use a context manager or not, we still need a global singleton to handle this (there might be other approaches with less issues, but I can't think of one now).

jaimergp avatar Jan 21 '20 21:01 jaimergp

There are some singleton implementations described here (you have to scroll down for the elegant/simple ones). This could be a solution, but we should have somebody with more experience comment on this.

jaimergp avatar Jan 21 '20 21:01 jaimergp

GLOBAL_TOOLKIT_REGISTRY = ToolkitRegistry(toolkit_precedence=[RDKitToolkitWrapper])

That's just defining a local variable, right?

One approach could be to add more methods to ToolkitRegistry to make it easier to achieve the desired behavior, following https://github.com/openforcefield/openforcefield/issues/468

jchodera avatar Jan 22 '20 02:01 jchodera

That's just defining a local variable, right?

That line in particular is just defining a local variable. I also tried setting the value of GLOBAL_TOOLKIT_REGISTRY down several import paths, to see if we could advise people on the right one to override until we fix this. I decided to open this issue when I tried every conceivable way to find the "real" Registry that was used for molecule.generate_conformers(), and I got all the way to

openforcefield.topology.molecule.openforcefield.utils.toolkits.GLOBAL_TOOLKIT_REGISTRY = ToolkitRegistry(toolkit_precedence=[RDKitToolkitWrapper])

which still didn't work. So, I think we need Jaime's suggested singleton approach.

j-wags avatar Jan 22 '20 15:01 j-wags

I'll point out that my example in https://github.com/openforcefield/openforcefield/issues/468 actually works! It would just need some way for the user to programmatically manipulate the order to be useful.

That said, I think the singleton approach is also a great idea! They're not mutually exclusive.

jchodera avatar Jan 22 '20 21:01 jchodera

It would be useful to update this unit test (and the typos in the comments 😅) in the process of tackling this issue: https://github.com/openforcefield/openforcefield/blob/8149b36e73b57cab83c958b1da4ca6d775093c34/openforcefield/tests/test_toolkits.py#L2456-L2473

mattwthompson avatar Jul 08 '20 16:07 mattwthompson

Apologies for the wall of text, I just got really interested in this for some reason - the short of it is that I think the original issue is resolved, but there are reasons to consider a refactor.

Some of this behavior is fixed now that GLOBAL_TOOLKIT_REGISTRY can be modified. Below is run with all toolkits available and stock behavior


In [1]: from openforcefield.topology import Molecule


In [2]: from openforcefield.utils.toolkits import *

In [3]: mol = Molecule.from_smiles('CC')
   ...: mol.generate_conformers()
   ...: print(mol.conformers[0][0])
[ 0.81511319 -0.53833634  0.49281269] A

In [4]: mol.generate_conformers(toolkit_registry=OpenEyeToolkitWrapper())
   ...: print(mol.conformers[0][0])
[ 0.81511319 -0.53833634  0.49281269] A  # Confirms default behavior uses OpenEye

In [5]: mol.generate_conformers(toolkit_registry=RDKitToolkitWrapper())
   ...: print(mol.conformers[0][0])
[-0.74552317  0.04144446  0.01170607] A

In [6]: GLOBAL_TOOLKIT_REGISTRY.deregister_toolkit(OpenEyeToolkitWrapper())

In [7]: mol.generate_conformers()
   ...: print(mol.conformers[0][0])
[-0.74552317  0.04144446  0.01170607] A  # Matches RDKit values

In [8]: from openforcefield.utils.toolkits import GLOBAL_TOOLKIT_REGISTRY

In [9]: mol.generate_conformers()
   ...: print(mol.conformers[0][0])
[-0.74552317  0.04144446  0.01170607] A  # Still matches RDKit values, even after re-importing

In [10]: GLOBAL_TOOLKIT_REGISTRY.registered_toolkits
Out[10]:
[ToolkitWrapper around The RDKit version 2020.03.4,
 ToolkitWrapper around AmberTools version 20.0,
 ToolkitWrapper around Built-in Toolkit version None]  # Still has OpenEye de-registered

This seems to satisfy one of the requirements above

Changing GLOBAL_TOOLKIT_REGISTRY should allow the user to change the the cheminformatics toolkit that's used on the backend.

I'm not great with global variables in Python (since it's something that should typically be avoided), but whatever is happening to that variable seems to persist whether or not you try and import it again. Its hash and memory location do not change when deregistering toolkits (not that they necessarily would be expected to) and re-importing it does not change behavior, implying that ...

The GLOBAL_TOOLKIT_REGISTRY can be imported from a number of modules -- for example:

The other ways to import it, I believe, just import it from openforcefield.utils.toolkits, which is the only place it's actually instantiated. So it's all pointing to the same variable, no matter how you get it.

I think the tricky thing happening above was GLOBAL_TOOLKIT_REGISTRY = ToolkitRegistry(toolkit_precedence=[RDKitToolkitWrapper]) since the variable sitting in an interpreter may not be what the methods call internally (without explicitly passing that in). I could be wrong

A singleton would mostly work here, since we want to be able to create a single instance of a class and forbid re-instantiation of it. From my research (learning many of these CS concepts on the fly) it seems to be fairy controversial whether or not a singleton is an anti-pattern (link, partially because global variables can serve most purposes better but also because people apparently have a habit of using singletons when they shouldn't. A downside here is that, along a sequence of unit tests, and modifications to GLOBAL_TOOLKIT_REGISTRY must be re-done before it is used again to ensure the "out of the box" behavior in another unit tests. From the snippets above, this seems to also happen with the current implementation and we've just gotten lucky with our unit needs not stressing this behavior. Another downside is that it is fairy difficult to extend, although I'm not sure we can anticipate making different versions of this one.

I'm concerned that what's really happening here is that a new GLOBAL_TOOLKIT_REGISTRY is being imported for each module.

I don't think this is happening, although my testing isn't that extensive.

... context manager approach, where we could use something like with using_toolkit(RDKit): ... to define blocks where certain toolkits or registries are used.

What is the benefit of using this over specifying the toolkit to each method call, or creating a temporary registries for particular purposes?

registry_to_use_now = ToolkitRegistry(...)
...  # do stuff
other_registry = ToolkitRegistry(...)
.... # do other stuff

I think the first question to address is whether GLOBAL_TOOLKIT_REGISTRY is actually a singleton or not.

I think it many respects it is, although I'm inclined to say that it's just a global-like variable that acts like a singleton. Since it's an instance, not a class, I don't think the definitions of singleton can really be considered here (unless we refactor it to a new class that exhibits the behaviors of one)

mattwthompson avatar Aug 20 '20 20:08 mattwthompson

I have a basic singleton implemented at the below commit. Making it an actual class has the benefit of enforcing its contents at instantiation but using GLOBAL_TOOLKIT_REGISTRY as a variable does nothing to prevent users from dangerous things like re-defining it. As long as we agree to never overwrite the variable with a different object, there's no trouble, but it would be cleaner to use the object, not the variable name, throughout function calls. All this being said, it's not clear to me if there is any use to it or if I'm just scope-creeping

6e2facf719f678a82cd2828b0c09c330e38c1cf2

mattwthompson avatar Aug 24 '20 23:08 mattwthompson

At some point @jaimergp suggested using context managers to set the toolkit registry for a given scope. May be a good thing to revisit.

The other thing is do we need a global toolkit registry? Seeing as most methods accept a toolkit as an argument, what does it give us that justifies the maintenance burden and potential user confusion?

SimonBoothroyd avatar Aug 25 '20 09:08 SimonBoothroyd

We've probably dug ourselves into a small by having a variable that hides in the background and hugely influences some behavior and breaks the "explicit is better than implicit" mantra, but I'm not sure it's something we can work around. Do we really want

with using_toolkit('openeye')
    mol = Molecule.from_smiles('CCO')

or

mol = Molecule.from_smiles('CCO', toolkit=RDKitToolkitWrapper)

to be required to sue these Molecule calls? If we wanted this to be hidden from the user, the calls to ToolkitWrapper methods inside of of the Molecule API will probably need some sort of default registry, anyway. The benefit here is that it "just works" in the background, as far as I can tell, and possibly after doing some modification at the top (i.e. de-registering OpenEye despite it being installed).

This isn't to say we shouldn't add a context manager as another way to control which toolkits are allowed to be used where, but I don't think we can get away from some sort of default/global registry floating around in the backend of things one way or another.

mattwthompson avatar Aug 25 '20 15:08 mattwthompson

I should have been a bit more specific - what I meant to ask was, do we really want to expose the global registry and have that as the main way for users to select the registry they want to use.

We've probably dug ourselves into a small by having a variable that hides in the background

I think that isn't necessarily a good reason to stick with an approach, and is only going to be harder to reverse if in the future you change you mind and have dug even deeper.

Before committing to the single access pattern, it's worth exploring what the other alternatives are and weighing each up. In particular, it would be good to think more about what the anticipated user uses of this will be, and exactly how you want the user to have control over these things.

SimonBoothroyd avatar Aug 25 '20 17:08 SimonBoothroyd

I definitely agree that we should give ourselves the freedom to move on from current implementations that limit us (in the present and future). I'm just personally having a hard time see how some aspects of the current implementation can be avoided. I agree it's probably a good time to think about how we want this to look long-term.

mattwthompson avatar Aug 25 '20 18:08 mattwthompson

Let me try to summarize the state of this discussion:

We have identified three topics which may or may not be totally separate:

  1. Should we enforce that GLOBAL_TOOLKIT_REGISTRY be a singleton?
  2. Should we encourage use patterns where people modify GLOBAL_TOOLKIT_REGISTRY?
  3. Should we make a context manager like using_toolkits(...) that safely handles temporary modifications to GLOBAL_TOOLKIT_REGISTRY?

My thoughts on these are:

For 1), I don't think this is necessary any longer. The driving force behind this suggestion was my concern that maybe we had tons of little GLOBAL_TOOLKIT_REGISTRYs running around. This doesn't actually appear to be the case. Instead, what's really happening is that assignment statements of the form from openforcefield.utils.toolkits import GLOBAL_TOOLKIT_REGISTRY; GLOBAL_TOOLKIT_REGISTRY = X don't actually modify the original imported object -- Instead it makes a new local variable that has nothing to do with openforcefield.utils.toolkits.GLOBAL_TOOLKIT_REGISTRY. So far, we have not found a way to do an in-place replacement of the original.

For 2): no. No we shouldn't. It's dangerous to register and deregister toolkits in the global registry live, and it introduces ambiguity about the scope of those changes. Regarding assignment calls, even if we could make GLOBAL_TOOLKIT_REGISTRY = ... statements work like I wanted, it would be unpythonic and confuse advanced python users who know that it shouldn't work that way.

For 3): That would be great as a new feature. It would enable a cleaner way to switch contexts, but it wouldn't fundamentally make any new behavior accessible, since the same thing could be achieved by repeatedly specifying toolkit_registry=... kwargs throughout a code block. It would probably make US do the dirty work of 2) in secret, but that's actually good since at least the complexity would be cordoned off into a single place.

In conclusion

  • We've "solved" the problem as stated in the title of the PR -- The problem isn't the "behavior", it what I "anticipated" the behavior should be.
  • We haven't found any actual problems that would be solved by enforcing singleton-ness
  • We should move away from advertising the existence of GLOBAL_TOOLKIT_REGISTRY to users, or recommending modifying it to solve problems.
  • I'd like to pursue the context manager idea

Does this seem like a good route forward?

j-wags avatar Aug 26 '20 22:08 j-wags

A very nice summary @j-wags !

I'd like to pursue the context manager idea

I'd also be interested in seeing this, with the idea that it would be good to mock something up, try it out, get feedback, but then dump it if it doesn't feel right! 🙂

Also happy to stick with just passing as an argument, but as you say that it is a bit more tedious and possibly easy to miss once if you are passing it several times in succession.

We should move away from advertising the existence of GLOBAL_TOOLKIT_REGISTRY to users, or recommending modifying it to solve problems.

It would probably be good to just expose a read-only way of accessing it (e.g. a static global_toolkit_registry() function which returns a copy), but not a way in which it can be overwritten / mutated.

SimonBoothroyd avatar Aug 27 '20 08:08 SimonBoothroyd

Agree with everything above, although it's a little unclear to me if we're considering dropping GLOBAL_TOOLKIT_REGISTRY altogether or carrying it along with some better behavior. This can be an open question while we iterate. (I wrote this comment out of order, so everything below assumes the answer is to drop it and may accidentally suggest that that we shouldn't.)

I took a quick stab at the context manager approach and can get something like this working without much struggle (attaching a notebook). It handles some of the above concerns without vastly changing the feel of the API

ToolkitWrapperContextManager.ipynb.gz

mol = Molecule.from_smiles('CCO')

RDKitToolkitWrapper().generate_conformers(molecule=mol, n_conformers=1)
rd_pos = mol.conformers

OpenEyeToolkitWrapper().generate_conformers(molecule=mol, n_conformers=1)
oe_pos = mol.conformers

with using_toolkit('rdkit') as toolkit_registry:
    mol.generate_conformers(toolkit_registry=toolkit_registry, n_conformers=1)
    assert not np.allclose(mol.conformers, oe_pos)
    assert np.allclose(mol.conformers, rd_pos)

with using_toolkit('openeye') as toolkit_registry:
    mol.generate_conformers(toolkit_registry=toolkit_registry, n_conformers=1)
    assert np.allclose(mol.conformers, oe_pos)
    assert not np.allclose(mol.conformers, rd_pos)

A sticking point, at least to me, is: Do we want every call that eventually wraps a toolkit to require specifying the intended toolkit? The instructive examples for using context managers usually involve the with something as thing thing being the focus of what's going on, not something that happens in the background. In the linked example and many others, it's a database you're connecting to and then operating on. Or, in the simpler example, opening a file and doing stuff to that file, not doing stuff while you happen to have a file open in the background.

This is both a practical (how to shape the user experience?) and technical (how to partially hide that variable and pass it through to function calls?) question, as pointed out above. My current feelings on each are

  • I would like to be able to still do calls like Molecule.from_smiles(SMILES) without need to specify the wrapped toolkit every time.
  • I don't think a context manager alone will be able to do this. It may require doing something akin to gluing a toolkit_registry=_background_registry-like argument to a ton of API calls using functools.partial. I may be googling the wrong keywords or maybe fresh eyes tomorrow will yield something more simple and obvious.

mattwthompson avatar Aug 27 '20 21:08 mattwthompson

I should have been a bit more specific - what I meant to ask was, do we really want to expose the global registry and have that as the main way for users to select the registry they want to use.

although it's a little unclear to me if we're considering dropping GLOBAL_TOOLKIT_REGISTRY altogether or carrying it along with some better behavior

Ah, I'd kinda sidestepped the question because it was ambiguous what the call was there. I'm making the assumption that we will keep GLOBAL_TOOLKIT_REGISTRY but enforce/encourage better behavior by making it less prominent in our docs/examples.

It's a weird pattern, but I do really like how things like Molecule.from_smiles just magically "work". I'd be happy to consider other, equally automagic alternatives though.

j-wags avatar Aug 27 '20 22:08 j-wags

Maybe we can take a look at how other projects deal with alternative backends:

  • pytorch.distributed: https://pytorch.org/docs/stable/distributed.html

jaimergp avatar Aug 31 '20 07:08 jaimergp

One can override the GLOBAL_TOOLKIT_REGISTRY by switching the order of the precedence inside the object instead of creating a new GLOBAL_TOOLKIT_REGISTRY, but I recall using private attributes to do that so a public-facing method would be handy.

lilyminium avatar Jun 01 '21 19:06 lilyminium

I'm concerned that what's really happening here is that a new GLOBAL_TOOLKIT_REGISTRY is being imported for each module.

I just wanted to clear this up a little. Each module is importing the same variable from .toolkits. However, if I understand correctly, the objects get compiled into memory when you first import your module into your interpreter (from openff.toolkit.topology import Molecule). During this compilation, the function default arguments get evaluated. At this point the GLOBAL_TOOLKIT_REGISTRY variable name in the function signature is meaningless; instead, the function keyword points to the object in memory. Therefore you can't replace it by replacing GLOBAL_TOOLKIT_REGISTRY, whether as a local or global variable, but you can alter behaviour by modifying the object itself.

This doesn't actually appear to be the case. Instead, what's really happening is that assignment statements of the form from openforcefield.utils.toolkits import GLOBAL_TOOLKIT_REGISTRY; GLOBAL_TOOLKIT_REGISTRY = X don't actually modify the original imported object -- Instead it makes a new local variable that has nothing to do with openforcefield.utils.toolkits.GLOBAL_TOOLKIT_REGISTRY. So far, we have not found a way to do an in-place replacement of the original.

So, you can reassign openforcefield.utils.toolkits.GLOBAL_TOOLKIT_REGISTRY = MY_REGISTRY and new functions using this variable will be updated. However, existing, compiled, functions are now totally distinct. There is no way to do in-place replacement of the original in compiled objects.

lilyminium avatar Jun 01 '21 20:06 lilyminium

This has been added in the private API for some time. This function has been working well for our own purposes and I think it's ready for the prime time. Once this method becomes public this issue can be closed.

j-wags avatar Aug 16 '23 19:08 j-wags