openff-toolkit icon indicating copy to clipboard operation
openff-toolkit copied to clipboard

Toolkit showcase update

Open Yoshanuikabundi opened this issue 3 years ago • 6 comments

This PR updates the showcase to use all the new magic in the toolkit and interchange. It will no longer use Parmed and OpenMM force fields.

It also includes performance optimizations for new features required to run the example in an acceptable time frame.

Yoshanuikabundi avatar Aug 11 '22 01:08 Yoshanuikabundi

Codecov Report

Merging #1368 (70e776d) into main (2a9d6ec) will decrease coverage by 0.09%. The diff coverage is 95.04%.

codecov[bot] avatar Aug 11 '22 01:08 codecov[bot]

@Yoshanuikabundi please tag me or reach out to me directly if you have any questions or suggestions on integrating Interchange in this, including patches to other repos. I don't want to get in your way but I'm also interested and invested in getting this working nicely 🙂

mattwthompson avatar Aug 11 '22 03:08 mattwthompson

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

I've added a few potentially controversial methods to Topology:

  • Topology.get_positions()
  • Topology.set_positions()
  • Topology.visualize()

I'm happy to make them private or put them in the notebook itself, but I think all three make life a lot easier. The first two are just convenience methods for accessing and updating all the positions at once and work by iterating over molecules and assigning/reading the first conformers of each. At some point we should talk about adding an option to the PDB output methods to use standard atom names, because I haven't found any protein visualization software that can render a cartoon on our PDB outputs (and the atom names are not even unique, possibly because the field in a PDB isn't long enough), but for the moment the new visualize() method can fill in this gap.

@mattwthompson Thanks for the offer! Turns out I need you immediately. When I try to construct an OpenMM topology, Interchange complains about virtual sites (AFAIK there should be no virtual sites in this system, but Interchange seems to expect them):

---------------------------------------------------------------------------
LookupError                               Traceback (most recent call last)
Input In [14], in <cell line: 2>()
      1 omm_system = interchange.to_openmm()
----> 2 omm_top = interchange.to_openmm_topology()

File ~/Documents/openff/openff-toolkit/.soap/examples/lib/python3.10/site-packages/openff/interchange/components/interchange.py:529, in Interchange.to_openmm_topology(self, ensure_unique_atom_names)
    526 """Export components of this Interchange to an OpenMM Topology."""
    527 from openff.interchange.interop.openmm import to_openmm_topology
--> 529 return to_openmm_topology(
    530     self, ensure_unique_atom_names=ensure_unique_atom_names
    531 )

File ~/Documents/openff/openff-toolkit/.soap/examples/lib/python3.10/site-packages/openff/interchange/interop/openmm.py:922, in to_openmm_topology(interchange, ensure_unique_atom_names)
    916 from openff.interchange.interop._virtual_sites import (
    917     _virtual_site_parent_molecule_mapping,
    918 )
    920 topology = interchange.topology
--> 922 virtual_site_molecule_map = _virtual_site_parent_molecule_mapping(interchange)
    924 molecule_virtual_site_map = defaultdict(list)
    926 for virtual_site, molecule in virtual_site_molecule_map.items():

File ~/Documents/openff/openff-toolkit/.soap/examples/lib/python3.10/site-packages/openff/interchange/interop/_virtual_sites.py:18, in _virtual_site_parent_molecule_mapping(interchange)
     13 def _virtual_site_parent_molecule_mapping(
     14     interchange: "Interchange",
     15 ) -> Dict[VirtualSiteKey, Molecule]:
     16     mapping = dict()
---> 18     for virtual_site_key in interchange["VirtualSites"].slot_map:
     19         virtual_site_key: VirtualSiteKey  # type: ignore[no-redef]
     20         parent_atom_index = virtual_site_key.orientation_atom_indices[0]

File ~/Documents/openff/openff-toolkit/.soap/examples/lib/python3.10/site-packages/openff/interchange/components/interchange.py:826, in Interchange.__getitem__(self, item)
    824     return self.handlers[item]
    825 else:
--> 826     raise LookupError(
    827         f"Could not find component {item}. This object has the following "
    828         f"potential handlers registered:\n\t{[*self.handlers.keys()]}"
    829     )

LookupError: Could not find component VirtualSites. This object has the following potential handlers registered:
	['Bonds', 'Constraints', 'Angles', 'ProperTorsions', 'ImproperTorsions', 'vdW', 'Electrostatics']

A minimal example is (with the attached topology.json):

from openff.toolkit import Topology, ForceField
from openmm import unit as openmm_unit
from openff.units.openmm import from_openmm

with open("topology.json", "r") as f:
    top = Topology.from_json(f.read())

sage_ff14sb = ForceField("openff-2.0.0.offxml", "ff14sb_off_impropers_0.0.3.offxml")
interchange = sage_ff14sb.create_interchange(top)
omm_top = interchange.to_openmm_topology()

If you want to see how the topology is constructed, I've committed and pushed my current working version of the notebook. Note that it requires changes to the Toolkit itself that are included on this branch. Please ignore the prose, I'm working on the code first.

I'm excited to get this working!

Yoshanuikabundi avatar Aug 11 '22 11:08 Yoshanuikabundi

My first-impression is that these new Topology methods should be standalone functions scoped to the notebook; I think we are getting close to the wire for significant changes to the API in this release and writing them as functions allows us to defer decisions into the future - enabling the notebook to work just about as well as if they were methods but requiring less testing and consensus-gathering. I'm moderately opposed to positions being on topologies here and my opinion on Topology.visualize() is closer to neutrality.

I'll fix that bug shortly, no reason that logic should be in place in the way it is. I thought I wrote a test for that, guess not.

mattwthompson avatar Aug 11 '22 16:08 mattwthompson

I've added a few potentially controversial methods to Topology:

Topology.get_positions()
Topology.set_positions()
Topology.visualize()

I'm happy to make them private or put them in the notebook itself, but I think all three make life a lot easier.

My first-impression is that these new Topology methods should be standalone functions scoped to the notebook; I think we are getting close to the wire for significant changes to the API in this release and writing them as functions allows us to defer decisions into the future

100% agree with Matt here - Let's just define these in the notebook for now! We've been talking about exposing positions on topologies for a while but every option seems to be a steep tradeoff between horrible complexity and user confusion/footguns. So it'll need more thought, but this close to release it'll be best to keep this out of the API.

j-wags avatar Aug 11 '22 17:08 j-wags

Ok, I've moved those methods back into the notebook.

I'm moderately opposed to positions being on topologies here We've been talking about exposing positions on topologies for a while but every option seems to be a steep tradeoff between horrible complexity and user confusion/footguns.

We need to make a decision about this, because I need to document (and already have) when users should introduce their positions. I thought we agreed in openff-interchange/436 that we would support velocities, visualization, and positions in topologies - hence me adding some of that support here! :sweat_smile:

@j-wags I think just providing Topology.get_positions() and Topology.set_positions() is the compromise that avoids confusion without much added complexity. Both methods work by iterating over the molecules' first conformers. The get/set pattern makes it clear that you're working with a copy of the data, and we document that as well. Both methods together including docstrings are less than 60 lines of code and include all the safety checks you could hope for. I suspect if we don't provide these methods, our users will write them themselves. But we don't have to do that today!

Here's me ranting about how we should conceptually have positions on topologies (and a solvate method) for a few pages, you should probably skip it but I couldn't bear to delete it.

I'll just recap the current situation:

  • Interchange stores the system positions and velocities in the .positions and .velocities arrays
  • Interchange initializes its positions from the first conformation of each molecule in the topology iff every molecule has at least one conformer.
  • Interchange works with .positions or .velocities set to None, but export methods requiring those values will fail
  • Interchange supports a .visualize() method
  • By design, methods for adding or removing molecules to an existing Interchange do not exist

In the showcase, this makes things a bit awkward. Our inputs are:

a. A PDB of the receptor protein with crystallographic waters (that we want to preserve) b. An SDF of the small molecule ligand

To assemble the system, the current strategy is:

  1. Solvate the PDB with crystallographic waters with PDBFixer, which takes a PDB as input
  2. Split the protein out into its own PDB and import it with Molecule.from_polymer_pdb()
  3. Import the fully solvated system directly from PDBFixer with Topology.from_openmm(), passing in the above protein molecule as one of the unique_molecules

At this point, we have:

a. a topology with everything except the ligand, without positions b. an array of positions for everything except the ligand (from PDBFixer) c. an SDF of the ligand, with positions

If topologies don't have positions, what do we do here? Something like identify the clashing waters, remove their coordinates from the array, remove the corresponding number of waters from the topology, concatenate the SDF coordinates to the rest of the coordinates, add the ligand to the topology, convert it to an interchange, and THEN add the coordinates to the interchange. Conceptually, having positions in the topology is much simpler:

  1. Set the positions of the molecules in the topology with the array of positions from PDBFixer.
  2. Import the ligand with Molecule.from_file() and add it to the topology
  3. Remove water molecules that clash with the ligand

Note that I'm not talking about having a positions array in the topology; I'm just talking about what we suggest conceptually. Setting positions just means iterating over the molecules and setting the first conformer. This keeps everything object oriented, which is convenient for scientists. I think in most workflows involving proteins, the molecular identity and coordinates are stored together, and so they should enter the OpenFF ecosystem together. Plus, if topologies have coordinates, a user can easily parametrize the exact same system with different force fields.

This workflow is very complex. There is an important question as to whether the developers should pay that complexity cost or the users. I know a Topology.solvate() method is unpopular, but I would rather personally maintain a single solvation method in the toolkit than a separate one in every example, and I'm sure our users would appreciate it too. This would replace steps 1, 4, and 6 with a call to this method. It would mean either packaging equillibrated water boxes with the toolkit, or calling out to something like PDBFixer.

A Topology.from_pdb() method, which would take a list of unique molecules and just load the PDB with openmm.app.PDBFile and pass it through Topology.from_openmm, would be nice too, but this is simple enough that I'm happy for it to live in the examples.

Not asking for anything to change now, but if we're gonna say positions don't belong on topologies I need an absolutely rock solid reason because from a pedagogical and users' perspective I am 99.999% sure they do.


I need to flag a few other things that are driving me nuts at the moment so we can sort them out after 0.11.0:

  • Visualization at the moment is a massive pain, because we attempt to uniquify atom names at the drop of a hat. PyMol, VMD, and NGLView all rely on atom names to identify proteins. Interchange automatically attempts to uniquify atom names, and this means that all outputs are unable to be visualized as ribbons or cartoons, and you can't use selection languages to show side chains in a different representation, and so on. Call interchange.visualize() on the toolkit showcase interchange and you'll see... two caps and a ligand in ball and stick rep and nothing else. This happens even in Topology.to_file(), which is essential for visualization outside the notebook - and there, it produces PDBs that don't even have unique atom names past 100 atoms of an element because the name has a fixed width! I think we should have options everywhere to preserve atom names.

  • at the moment I have to write PDBs to disk a lot to pass information between libraries

    • It would be nice to be able to provide a file-like object to methods like Topology.to_file.
    • Exposing a method like Molecule.from_polymer_openmm_topology(), which is just Molecule.from_polymer_pdb() without the PDBFile step, would be wonderful too

@mattwthompson - What are the dangers of just doing top.to_openmm() rather than interchange.to_openmm_topology()? It seems to be working fine (and it lets me visualize things).


Also, Topology.to_json() and Topology.from_json() are wonderful methods whom I love.

Sorry if this is a grumpy comment.

Yoshanuikabundi avatar Aug 15 '22 06:08 Yoshanuikabundi

What are the dangers of just doing top.to_openmm() rather than interchange.to_openmm_topology()?

Not a danger, but the toolkit's method does not include virtual sites (by design). Otherwise their behavior should be identical

mattwthompson avatar Aug 15 '22 09:08 mattwthompson

I thought we agreed in https://github.com/openforcefield/openff-interchange/issues/436#issuecomment-1104609747 that we would support velocities, visualization, and positions in topologies - hence me adding some of that support here! 😅

Woah, so I did. My mistake @Yoshanuikabundi, and thanks for correcting me. You're right - We agreed to add positions to topologies (and velocities, which doesn't need to be done in this PR). So I'll stick with that decision. Sorry for the confusion and thanks for linking to the written decision :-)

Visualization at the moment is a massive pain, because we attempt to uniquify atom names at the drop of a hat. PyMol, VMD, and NGLView all rely on atom names to identify proteins

Agreed. It looks like you've already switched ensure_unique_atom_names=False in the Toolkit calls, which is great. @mattwthompson, would you be open to a future PR to Interchange that prevents renaming in Interchange.visualize?

It would be nice to be able to provide a file-like object to methods like Topology.to_file.

Oh, that'd be great. Happy to have that in this PR or in a future one.

Exposing a method like Molecule.from_polymer_openmm_topology(), which is just Molecule.from_polymer_pdb() without the PDBFile step, would be wonderful too

Agree, but that will take more quality assurance/documentation than we have time for in the near future. I think the current from_polymer_pdb behavior relies somewhat on the fact that the topology it's loading came from an OpenMM PDBFile object. So we'll need additional logic to handle/provide useful errors for loading general OpenMM topologies.

j-wags avatar Aug 15 '22 21:08 j-wags

This pull request introduces 1 alert when merging 1adc90cd35500df6680ce0ee190f1a5c41dd25a4 into db20b63b287ae7d5ad4e4ed97be50bad79e4dcbe - view on LGTM.com

new alerts:

  • 1 for Unused import

lgtm-com[bot] avatar Aug 16 '22 11:08 lgtm-com[bot]

Sorry @mattwthompson - I see Interchange.to_openmm_topology(ensure_unique_atom_names=False) does work already, and Interchange.visualize() does preserve atom names. I must have woken up on the wrong side of the bed the other day. The vsites error has also gone away! Thanks!

@j-wags - I've implemented the changes to ensure_unique_atom_names in commit f76a02a as discussed on slack. If we like it, I'll write tests and submit a PR for the equivalent functionality in Interchange, and if it needs more dev time I'll move it to its own PR.

What do we want to do if we can't get speedy GROMACS support working in time? Just remove the appendix?

Yoshanuikabundi avatar Aug 17 '22 04:08 Yoshanuikabundi

OK well this PR is a monstrosity. Sorry you have to review it @j-wags. I think it is done though, and I expect the currently running tests to pass!

Yoshanuikabundi avatar Aug 23 '22 07:08 Yoshanuikabundi

This must be a new GH feature - I think by making changes after you gave an approving review I've undone your approval. I'll merge with my admin privileges once tests pass and I've double checked the rendered docs but this might need addressing for the future.

Yoshanuikabundi avatar Aug 24 '22 02:08 Yoshanuikabundi