BioSimSpace
BioSimSpace copied to clipboard
Checking the mass of H atoms
Hello! I'm trying to introduce a check into the Relative class (feature-amber-fep for now) to throw a warning message if the timestep in the protocol is 4fs or more, but (based on the engine) HMR has not been applied. I was thinking of doing this by looking at the mass of H atoms in the passed system during the init stage:
# HMR check
# Check the timestep of the system.
timestep = self._protocol.getTimeStep.picoseconds().value()
# If it is more than 4fs, HMR should have been applied.
if timestep > 0.004:
# check system to see mass of H of perturbable molecule
perturbableMolecule = system.getPerturbableMolecules()[0]
# search for H atoms
for atom in perturbableMolecule.getAtoms():
if atom.element() == "Hydrogen (H, 1)":
atom = atom._getSireObject()
# ???
else:
pass
However I'm not really sure how to check the mass of the underlying sire object, or if this is the best way to go about it? Each engine would need to check for a different factor; for AMBER, a HMR factor of 3, for GROMACS, a HMR factor of 4, and for SOMD, no factor needs to have been applied as this is handled in the config file.
Alternatively, the HMR factor could only be applied to the system once it is passed into relative together with the protocol and the 4 fs timestep automatically, but this would give less flexibility in terms of the user choosing a different factor if they want to.
Thank you!
Hi there,
I would prefer if we set HMR as an option at the protocol level, (This could be added to any protocol where we'd like to support HMR.) We would then check for appropriate hydrogen masses when the process is initialised, then repartition them appropriately using system.repartionHydrogenMass()
if they haven't. By default the repartitioning factor could be something like "auto"
, which would use the recommended value for each engine, e.g. default HMR related options could be something like hmr=False
, hmr_factor="auto"
, hmr_water=False
. (Here "auto"
would use a factor of 3 for AMBER, 4 for GROMACS, etc.)
To check hydrogen masses you could do something along the lines of what you suggest. I'm not sure if we want to do this for all hydrogen atoms to which the repartitioning would be applied, or just the first we find. (For simplicity and speed.) In your case, you could do something like the following (not tested):
from Sire.Mol import Element
from pytest import approx
# Extract the mass property name from the user map.
mass_prop = property_map.get("mass", "mass")
# Set the expected HMR factor.
hmr_factor = protocol.getHMRFactor()
if protocol.getHMRFactor() == "auto":
if engine == "AMBER":
hmr_factor = 3
elif engine == "GROMACS":
hmr_factor = 4
...
# Extract the molecules to which HMR applies.
if protocol.getHMRWater() == False:
water = "no"
molecules = system.search("not water", property_map)
else:
water = "yes"
molecules = system.getMolecules()
# Whether we need to repartition.
repartition = False
# Loop over the molecules we need to check.
for mol in molecules:
# Loop over all hydrogen atoms in the molecule.
for atom in mol.search("element H", property_map)
# Get the mass in g per mol.
mass = atom._sire_object.property(mass_prop).value()
# Check that the mass matches what is expected.
if mass not approx(hmr_factor * hydrogen_mass, rel=1e-4):
repartition = True
break
# Repartition if necessary.
if repartition:
system.repartitionHydrogenMass(factor=hmr_factor, water=water, property_map)
There would be issues if somehow the masses were incorrectly repartitioned already, or partially repartitioned, since we would apply full repartitioning based on whatever the masses already are. This is probably unlikely, though.
Since the above checks all hydrogens in the appropriate molecules it might be slow for large systems, especially if water is included. If so, it might be better to add this check as a Sire function (in C++) and wrap to Python. I can do this at some point if needed.
Cheers.
There would also be some redundancy, since an appropriate time step would also need to be chosen. I guess HMR could be activated by default (with default options) whenever a 4fs time step is chosen.
Apologies for all the suggestions. There are obviously lots of ways to implement this, so it would be good to discuss strategies. I'd just prefer to make sure this are as flexible as possible, while also allowing the standard use case to be implemented easily.
Thank you for all the suggestions! They're really helpful :smile:
Initially I wasn't sure with regards to implementing it on a protocol level as that could lead to multiple HMR repartitioning (eg if there is a min, eq, prod window setup, with HMR only being repartitioned in the protocol stage there would be three instances of it being carried out for the system which would be redundant) but I think checking for HMR and only applying it when the factor is not applied would circumvent this quite well, as repartitioning can still be applied before once and then not need to be carried out again.
I'm not sure if we want to do this for all hydrogen atoms to which the repartitioning would be applied, or just the first we find. (For simplicity and speed.)
I think that I'll try implementing it for the first H atom found for now for speed.
By default the repartitioning factor could be something like
"auto"
, which would use the recommended value for each engine, e.g. default HMR related options could be something likehmr=False
,hmr_factor="auto"
,hmr_water=False
. (Here"auto"
would use a factor of 3 for AMBER, 4 for GROMACS, etc.)
I guess HMR could be activated by default (with default options) whenever a 4fs time step is chosen.
I think this makes sense as running at 4 fs without HMR is not really possible without the runs crashing. I guess to still provide flexibility we could have hmr=None
set as a default, and if that is the case then based on the timestep have it changed to True or False. That way it can still be changed in the protocol (if someone would want to run some higher timestep without HMR for some reason) but otherwise repartition at higher timesteps?
I'll try adding and testing it into the feature-amber-fep branch for now and see how it goes! I was going to add it in similarly how the free energy mixin is implemented so it can be included in all the needed protocols?
For speed, if you just want the first hydrogen in a search you can use the subscripting syntax in the search string (this is {search term}[index]
), e.g.
h = molecules.search("{element H}[0]")
mass = h._sire_object.property("mass").value()
or you could use an evaluator, e.g.
mass = molecules.search("{element H}[0]")._sire_object.evaluate().mass()
For information for the near future, I have been adding more capability to the search syntax as part of the modernisation of the Sire API. This makes the searching more consistent, plus adds in evaluators that can work across molecules, plus adds in functions that call the appropriate evaluator directly. With this, you could calculate the average hydrogen mass and compare that to your criterion, e.g.
not_water_h = system["(not water) and element H"]
with_water_h = system["element H"]
avg_not_water_h_mass = not_water_h.mass() / len(not_water_h)
avg_with_water_h_mass = with_water_h.mass() / len(with_water_h)
if avg_not_water_h_mass.value() > XXX...
This is very fast as the search and the calculation of the total mass all happen in C++.
I am also planning on adding a mass search term, so that you will be able to do
heavy_hydrogen = system[f"element H and mass > {hmr_factor*hydrogen_mass}"]
Interesting, does using:
molecules.search("{element H}[0]")
abort immediately at the first match, or does it still do the complete search and just return the first item? Even though the search is done in C++, I've had to write some custom C++ functions to do certain things in BioSimSpce since the search became slow when dealing with very large protein-ligand systems.
This does the complete match, then looks for the first item (it is the simple case of the range, as the IDSubScriptEngine
supports ranges and negative indexing). The search can be optimised, so if there are slow parts we can profile and improve. I have already sped it up quite a bit by changing the way SelectResult
stores and retrieves the molecules.
Great. I think it would be good to have something like a num_results
option, that will exit immediately when that many results have been found.
For information for the near future, I have been adding more capability to the search syntax as part of the modernisation of the Sire API. This makes the searching more consistent, plus adds in evaluators that can work across molecules, plus adds in functions that call the appropriate evaluator directly. With this, you could calculate the average hydrogen mass and compare that to your criterion, e.g.
not_water_h = system["(not water) and element H"] with_water_h = system["element H"] avg_not_water_h_mass = not_water_h.mass() / len(not_water_h) avg_with_water_h_mass = with_water_h.mass() / len(with_water_h) if avg_not_water_h_mass.value() > XXX...
Thank you for the suggestions! That sounds like a better way to do it in the future than just a single H.
Regarding the current search, I was wondering about some error messages. If I do:
molecules = system.search("not water", property_map)
h = molecules.search("{element H}[0]")
I get an error message that 'SearchResult' object has no attribute 'search'
, however if I try to circumvent this by:
water_mols = system.getWaterMolecules()
molecules = system - water_mols
h = molecules.search("{element H}[0]")
mass = h._sire_object.property(mass_prop).value()
I get AttributeError: 'SelectResult' object has no attribute 'property'
and with mass = molecules.search("{element H}[0]")._sire_object.evaluate().mass()
I get AttributeError: 'SelectResult' object has no attribute 'evaluate'
.
The following options work though for finding the mass:
for mol in molecules:
for atom in molecule.search("element H", property_map):
mass = atom._sire_object.property(mass_prop).value()
# or
for atom in molecules.search("{element H}[0]"):
mass = atom._sire_object.evaluate().mass().value()
# or even
mass = (molecules.search("{element H}[0]"))[0]._sire_object.evaluate().mass().value()
I was mainly wondering if it is a general feature of the search function that what is returned needs to be indexed in order to then be used? As in the case for finding the non-water molecules, doing system - water_mols
is the slowest part of the code, and indexing would only return one molecule, although for mol in molecules
works okay too.
An additional issue I've run into is that, in the case of the solvated free leg, the only molecule (if hmr_water = False) in the system is the perturbed ligand. In this case, it is never able to find the H atom. I saved the perturbable system as a pdb and when looking at that, the types of all atoms for the perturbed molecule are for example:
ATOM 7534 O19 2506 0.000 0.000 0.000 1.00 0.00 X
ATOM 7535 C20 2506 0.000 0.000 0.000 1.00 0.00 X
ATOM 7536 CL2 2506 0.000 0.000 0.000 1.00 0.00 X
ATOM 7537 H4 2506 0.000 0.000 0.000 1.00 0.00 X
ATOM 7538 H5 2506 0.000 0.000 0.000 1.00 0.00 X
ATOM 7539 H6 2506 0.000 0.000 0.000 1.00 0.00 X
which I think may be the cause?
Hi there,
'SearchResult' object has no attribute 'search'
This is because when performing a search on the BioSimSpace objects you are getting back a BioSimSpace._SireWrappers.SearchResult
object, which is a wrapper around the underlying Sire object. This doesn't have its own search
attribute. I think the confusion was @chryswoods talking about Sire objects, whereas my original example uses the BioSimSpace wrappers.
Something like this should work:
for mol in system.search("not water"):
h = mol.search("{element H}[0]")
# Need to check for a single result, since there will be zero if there are no hydrogens.
if h.nResults() == 1:
mass = h[0]._sire_object.evaluate().mass().value()
We need to be careful, though, since the element search and mass evaluation would assume correctly named properties, e.g. element
and mass
. This is why I previously did things by passing in the property map and extracting properties by name. That way is general, since we use the user specified map (this says how the properties in the molecule are named) to retrieve things.
Hope this helps.
P.S. You can always get the underlying Sire object from a wrapper using the ._sire_object
attribute. This would then behave as in @chryswoods' examples.
That makes sense, thank you!
Sorry for causing confusion - yes, I was showing the Sire search functionality. Part of my work now is to make this fast and consistent enough that we can expose this more fully in BioSimSpace.
I agree that you need to use a map if you want to be general. You can also pass a map to evaluate().mass()
, e.g.
mass = h[0]._sire_object.evaluate().mass(map).value()
This is more powerful than searching for the properties directly, as it has the logic to, e.g. look for the element property and get the element mass if no explicit atom masses have been set. Similarly, evaluate().charge()
or evaluate().charge(map)
would look for formal charges if no charge properties have been set. The evaluate function also works across collections of atoms, e.g. h._sire_object.evaluate().mass(map)
would calculate the total mass across all matches (assuming this has been exposed to BioSimSpace).