torchmd-net icon indicating copy to clipboard operation
torchmd-net copied to clipboard

Physics based priors

Open peastman opened this issue 3 years ago • 72 comments

I want to implement some physics based priors. An example would be a Coulomb interaction based on pre-computed partial charges that are stored in the dataset. BasePrior.forward() is supposed to return a list of per-atom energy contributions, but physics based interactions usually do not decompose in that way. It would be much easier if it could just return a total energy for each sample.

What do you recommend as the cleanest way of implementing this?

peastman avatar Jun 23 '21 20:06 peastman

BasePrior.forward() returns the updated atomwise energy predictions, not only the prior contribution (see the Atomref prior as an example). It would be possible to add the prior divided by the number of atoms to each atomwise prediction but I don't think this is very clean. It might be better to have an atomwise_prior flag in BasePrior which determines whether it should be applied before or after reducing the atomwise predictions. We would then apply the prior model either before (as it is done now) or after the following line, based on this flag: https://github.com/compsciencelab/torchmd-net/blob/7a92b3ef7739da14aa8cdd2d87c8e9462d86cf14/torchmdnet/models/output_modules.py#L73 The flag can then be set through the super().__init__(atomwise_prior=True/False) call of the deriving class.

Feel free to open a PR if you want to have a go at this, otherwise I can also make this change.

PhilippThoelke avatar Jun 24 '21 08:06 PhilippThoelke

Sounds good, I'll add that.

peastman avatar Jun 24 '21 20:06 peastman

Sorry for reopening an old issue - this never actually got implemented! I'm about to start on it now. I'll add the atomwise_prior flag as suggested above. I also have a few other questions about the best way of implementing particular potentials.

Physics based potentials will often require extra information. For example, one prior I want to add is the ZBL stopping potential for short range repulsion. It requires knowing the atomic number of every atom. forward() takes an argument z which often corresponds to the atomic number, but not always. For example, in some of my models I define atom types by the combination of element and formal charge, so z is an arbitrary atom type index. I need to separately provide both an atom type and an atomic number.

A more complicated example is Coulomb, which depends on the charge of each atom. That could be a fixed number (formal charge or precomputed partial charge), but in other cases it will itself be calculated as an output of the model.

What's the best way of providing extra information like this that's needed to compute a physics based potential?

peastman avatar Oct 03 '22 20:10 peastman

Another question: currently it only lets you specify a single prior model, but often you may want more than one. A ZBL potential for short range repulsion, plus Coulomb for long range interactions, plus D4 for dispersion. (SpookyNet includes all of those.) Any thoughts on the best way of handling this?

peastman avatar Oct 03 '22 20:10 peastman

And another question: how do we want to handle units? Nothing in the current code worries about them, except sometimes for a particular dataset. The inputs and outputs to a model are just numbers that might have any units. But physical potentials involve hardcoded constants like eps_0 whose values depend on units. How should we handle this? Some possibilities include

  • Standardizing on a particular set of units for all models.
  • Standardize on a particular set of units for a particular potential function, and require the user to provide scaling factors for converting positions, energies, etc. to the required units.
  • Implement it at a higher level, where a model specifies what units it uses.
    • We could also make each Dataset specify its units, and conversions between the dataset and model units would happen automatically.

peastman avatar Oct 03 '22 22:10 peastman

can you use that single model as a wrapper for all the models that you would like to use? It's just the entry point.

On Mon, Oct 3, 2022 at 4:45 PM Peter Eastman @.***> wrote:

Another question: currently it only lets you specify a single prior model, but often you may want more than one. A ZBL potential for short range repulsion, plus Coulomb for long range interactions, plus D4 for dispersion. (SpookyNet includes all of those.) Any thoughts on the best way of handling this?

— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/26#issuecomment-1266026524, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOQDFOZQNOJDW332PW3WBNAWFANCNFSM47GPOV2Q . You are receiving this because you modified the open/close state.Message ID: @.***>

giadefa avatar Oct 04 '22 12:10 giadefa

The code is unit-neutral, whatever you use as input it spits out as output

On Mon, Oct 3, 2022 at 6:48 PM Peter Eastman @.***> wrote:

And another question: how do we want to handle units? Nothing in the current code worries about them, except sometimes for a particular dataset. The inputs and outputs to a model are just numbers that might have any units. But physical potentials involve hardcoded constants like eps_0 whose values depend on units. How should we handle this? Some possibilities include

  • Standardizing on a particular set of units for all models.
  • Standardize on a particular set of units for a particular potential function, and require the user to provide scaling factors for converting positions, energies, etc. to the required units.
  • Implement it at a higher level, where a model specifies what units it uses.
    • We could also make each Dataset specify its units, and conversions between the dataset and model units would happen automatically.

— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/26#issuecomment-1266154668, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOSBZZSRRT5ZD6ARC5TWBNPCJANCNFSM47GPOV2Q . You are receiving this because you modified the open/close state.Message ID: @.***>

giadefa avatar Oct 04 '22 12:10 giadefa

The code is unit-neutral, whatever you use as input it spits out as output

I know. And that's incompatible with physics based potentials. They involve internal parameters with units. It's impossible to compute them without knowing what units the inputs are in, and what units the outputs are expected to be in.

peastman avatar Oct 04 '22 15:10 peastman

can you use that single model as a wrapper for all the models that you would like to use? It's just the entry point.

Currently there's a single option in the configuration file, prior_model. It specifies the class of the single prior model to create, whose constructor has to take no arguments except the dataset. We need to be able to specify much more complicated arrangements. For example, "Include both a ZBL potential, using atomic numbers specified in the dataset, and a Coulomb potential, using charges that are generated as an output of the model."

peastman avatar Oct 04 '22 16:10 peastman

who builds the potential knows, so the problem is when we distribute it, maybe we can add some optional unit indication somehow?

On Tue, Oct 4, 2022 at 11:56 AM Peter Eastman @.***> wrote:

The code is unit-neutral, whatever you use as input it spits out as output

I know. And that's incompatible with physics based potentials. They involve internal parameters with units. It's impossible to compute them without knowing what units the inputs are in, and what units the outputs are expected to be in.

— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/26#issuecomment-1267222515, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOSNCH2E4YKKSGLTXC3WBRHUJANCNFSM47GPOV2Q . You are receiving this because you modified the open/close state.Message ID: @.***>

giadefa avatar Oct 04 '22 16:10 giadefa

We could extend that as we did for the dataset arguments with a prior_args dictionary

On Tue, Oct 4, 2022 at 12:05 PM Peter Eastman @.***> wrote:

can you use that single model as a wrapper for all the models that you would like to use? It's just the entry point.

Currently there's a single option in the configuration file, prior_model. It specifies the class of the single prior model to create, whose constructor has to take no arguments except the dataset. We need to be able to specify much more complicated arrangements. For example, "Include both a ZBL potential, using atomic numbers specified in the dataset, and a Coulomb potential, using charges that are generated as an output of the model."

— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/26#issuecomment-1267233439, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUORJGKTK5PHL2OVOPPLWBRIS7ANCNFSM47GPOV2Q . You are receiving this because you modified the open/close state.Message ID: @.***>

giadefa avatar Oct 04 '22 16:10 giadefa

Perhaps we're thinking about this differently. Currently the file priors.py includes only a single prior model: Atomref, which subtracts a reference energy that is defined in the dataset for each atom type. My understanding is that we want to add more choices that implement common physical models. Someone should be able to specify prior_model: Coulomb in their configuration file, and it will add a Coulomb interaction to whatever model they're trying to train. Is that different from your understanding?

peastman avatar Oct 04 '22 16:10 peastman

yes, that is fine. Maybe I don't understand the problem, you can specify whatever class.

Maybe just do what you need on a PR so we can discuss it. Especially if it's backward compatible, we can merge it.

g

On Tue, Oct 4, 2022 at 12:15 PM Peter Eastman @.***> wrote:

Perhaps we're thinking about this differently. Currently the file priors.py includes only a single prior model: Atomref, which subtracts a reference energy that is defined in the dataset for each atom type. My understanding is that we want to add more choices that implement common physical models. Someone should be able to specify prior_model: Coulomb in their configuration file, and it will add a Coulomb interaction to whatever model they're trying to train. Is that different from your understanding?

— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/26#issuecomment-1267246646, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOWZ4NQG2VVUKFAVSTTWBRJZXANCNFSM47GPOV2Q . You are receiving this because you modified the open/close state.Message ID: @.***>

giadefa avatar Oct 04 '22 16:10 giadefa

Let's consider the problem of adding a Coulomb interaction. To begin with, it can be implemented in several ways.

  • The dataset provides a precomputed charge for every atom. You compute the interaction based on them.
  • The model predicts a charge for every atom. They get used to compute the interaction.
  • The model predicts an electronegativity for every atom. You use them to solve for the charges, which then get used to compute the interaction. This also requires knowing the total charge on every molecule, or even better the formal charge on every atom.

First problem is that I don't think TorchMD-Net really supports multitask models yet? Its models produce a single number for each atom, which is interpreted as energy. We need them to produce multiple values for each atom: both energy and charge, or energy and electronegativity. That might also involve extra terms in the loss function: train the model to reproduce charges specified in the dataset.

Next we need to define a mechanism for datasets to provide the extra information required: partial charges, formal charges, or both.

Then there's the problem of units. The Coulomb code receives positions in some units and it needs to produce an energy in some units. The value of eps_0 depends on the units. It's impossible to calculate the result without knowing the units.

Finally we need to design the user interface for all of this. What options will the user add to their configuration file to request a particular combination of physical terms, calculated in a particular way, trained with a particular loss function? You need to be able to specify things like this: "My dataset reports atomic numbers, formal charges, and partial charges for every atom. I want the model to predict electronegativities, which will be used to compute partial charges. Include a loss term based on how well the predicted charges match the ones in the dataset. Once the charges are computed, add ZBL, Coulomb, and dispersion terms to the final energy."

peastman avatar Oct 04 '22 17:10 peastman

Would it be better to just delta the forces outside torchmd-net? For example in openMM and just use torchmd-net for the NNP part?

Let's have a chat about it. I'll contact you directly

On Tue, Oct 4, 2022 at 1:08 PM Peter Eastman @.***> wrote:

Let's consider the problem of adding a Coulomb interaction. To begin with, it can be implemented in several ways.

  • The dataset provides a precomputed charge for every atom. You compute the interaction based on them.
  • The model predicts a charge for every atom. They get used to compute the interaction.
  • The model predicts an electronegativity for every atom. You use them to solve for the charges, which then get used to compute the interaction. This also requires knowing the total charge on every molecule, or even better the formal charge on every atom.

First problem is that I don't think TorchMD-Net really supports multitask models yet? Its models produce a single number for each atom, which is interpreted as energy. We need them to produce multiple values for each atom: both energy and charge, or energy and electronegativity. That might also involve extra terms in the loss function: train the model to reproduce charges specified in the dataset.

Next we need to define a mechanism for datasets to provide the extra information required: partial charges, formal charges, or both.

Then there's the problem of units. The Coulomb code receives positions in some units and it needs to produce an energy in some units. The value of eps_0 depends on the units. It's impossible to calculate the result without knowing the units.

Finally we need to design the user interface for all of this. What options will the user add to their configuration file to request a particular combination of physical terms, calculated in a particular way, trained with a particular loss function? You need to be able to specify things like this: "My dataset reports atomic numbers, formal charges, and partial charges for every atom. I want the model to predict electronegativities, which will be used to compute partial charges. Include a loss term based on how well the predicted charges match the ones in the dataset. Once the charges are computed, add ZBL, Coulomb, and dispersion terms to the final energy."

— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/26#issuecomment-1267305903, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUORU5VPJAKUYACSSO6DWBRQAFANCNFSM47GPOV2Q . You are receiving this because you modified the open/close state.Message ID: @.***>

giadefa avatar Oct 04 '22 17:10 giadefa

Multi-head implementation

Yes, currently models can only have a single output head. I think adding multi-head support could be useful, however, requires changes to the current training interface and poses some important design questions.

For example: With multiple output heads it should be possible to individually set the standardize argument, the type of output model (i.e. output_model argument), every output head should get its own list of prior models (and in turn list of prior_args), its own unit, a loss weighting factor, a "compute neg. derivatives" flag, exponential moving average coefficients for y and potentially neg_dy and potentially a couple others I'm forgetting now.

If then you want charges to come either from the model or from the dataset it gets even more complicated, indicating which one should be used. You would also need to order the computation of the output heads as you are suggesting that the output of one head could depend on the prediction of another. This relationship would also have to be defined in the config file.

Prior args

We could extend that as we did for the dataset arguments with a prior_args dictionary

This already exists and currently is a critical piece in order to reconstruct prior models from model checkpoints. If the model contains a prior model, the arguments required for building this model will be stored in the hparams.yaml file as they are set here https://github.com/torchmd/torchmd-net/blob/a80e3787b7f6bee32e728187fd6b5866359e8577/torchmdnet/scripts/train.py#L129 The load_model function expects prior_args to be set. The prior model itself expects either the dataset object or the prior args to be present to correctly instantiate the object. For example the Atomref prior depends on max_z, which is stored in prior_args but falls back to retrieving this from the dataset object in case max_z is not set: https://github.com/torchmd/torchmd-net/blob/a80e3787b7f6bee32e728187fd6b5866359e8577/torchmdnet/priors.py#L50-L57 This is the section of code that reconstructs a prior model from a checkpoint file but that is ignored if it's the first time constructing this model: https://github.com/torchmd/torchmd-net/blob/a80e3787b7f6bee32e728187fd6b5866359e8577/torchmdnet/models/model.py#L67-L77 In the first instantiated of a model, the prior_model variable will not be None in this context as it is directly passed to the create_model function: https://github.com/torchmd/torchmd-net/blob/a80e3787b7f6bee32e728187fd6b5866359e8577/torchmdnet/module.py#L24

Defining units

I think the freedom the current unit handling (i.e. none) gives is very powerful. If you want to predict some property you don't have to rely on us supporting units for this property but instead just plug in a dataset with the appropriate label. For self- or unsupervised pretraining we also don't necessarily have a unit that we predict, if for example we mask the atom type of one atom and predict this. To pass unit information to prior models I think it does make sense that datasets have some standardized way of defining their units (if it makes sense to define a unit for a certain dataset). Similarly to how Atomref currently works, it relies on the dataset implementing the get_atomref function. The way that the Atomref prior accesses this function here: https://github.com/torchmd/torchmd-net/blob/a80e3787b7f6bee32e728187fd6b5866359e8577/torchmdnet/priors.py#L57 you could in the same way access some kind of get_units function and throw an error if a given dataset doesn't implement that.

PhilippThoelke avatar Oct 04 '22 17:10 PhilippThoelke

And yes, through the config file and model creation/loading you are currently limited to just one prior model, however, technically it would be possible to wrap the model in multiple prior models by just nesting them.

PhilippThoelke avatar Oct 04 '22 17:10 PhilippThoelke

Let's have a chat about it. I'll contact you directly

That would be great if all three of us can meet to discuss it.

You would also need to order the computation of the output heads as you are suggesting that the output of one head could depend on the prediction of another.

I think that's more general than what we need. For Coulomb the two outputs (per-atom energy and per-atom charge) can be predicted independently. Each one gets fed into a calculation that produces a per-sample energy, and the two get added together. They're only linked through the loss function.

peastman avatar Oct 04 '22 18:10 peastman

The prior model itself expects either the dataset object or the prior args to be present to correctly instantiate the object.

In this case it's not strictly either/or. Some attributes will be retrieved from the dataset (for example, the atomic numbers for atom types). Others always need to be specified in the parameter file (for example, the cutoff distance). The existing code assumes everything can be retrieved from the dataset:

https://github.com/torchmd/torchmd-net/blob/c1c4fcf80166e31661fe27753073a2eb3cd34f55/torchmdnet/scripts/train.py#L128

Any suggestions on the best way to handle this, so we can pass configuration options to the prior even when first creating it?

peastman avatar Oct 05 '22 23:10 peastman

You could simply pass args to the prior model in the same step as passing the dataset. If needed you could add an initial_prior_args of some sorts that can maybe take a dictionary (similar to how the dataset_arg currently works) with the required parameters for the prior model you want to create.

PhilippThoelke avatar Oct 05 '22 23:10 PhilippThoelke

With ZBL implemented in #134 and D2 in progress in #143, the next question is how to combine multiple priors. What if I want a model to include both ZBL and D2?

The internal changes can be very simple. We can just make TorchMD_Net.prior_model into a list. pre_reduce() and post_reduce() are each called on all the priors in order. So mostly this is a question of how the user specifies the prior models in the config file.

One option is to make prior_model and prior_args into lists:

prior_model:
  - ZBL
  - D2
prior_args:
  - {"cutoff_distance": 4.0, "max_num_neighbors": 50}
  - {"cutoff_distance": 10.0, "max_num_neighbors": 100}

A possibly cleaner option is to combine them:

prior_model:
  - ZBL:
      cutoff_distance: 4.0
      max_num_neighbors: 50
  - D2:
      cutoff_distance: 10.0
      max_num_neighbors: 100

peastman avatar Nov 01 '22 21:11 peastman

Sounds good! I think version two is way more readable, which I think is what we should be going for in the config files. I reckon both versions will be hard to implement in the CLI?

PhilippThoelke avatar Nov 01 '22 22:11 PhilippThoelke

We would have to come up with a syntax for specifying it on the command line and write our own code for parsing it.

peastman avatar Nov 01 '22 23:11 peastman

we could also drop the CLI and assume we initialize via a yaml file.

On Wed, Nov 2, 2022 at 12:03 AM Peter Eastman @.***> wrote:

We would have to come up with a syntax for specifying it on the command line and write our own code for parsing it.

— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/26#issuecomment-1299333088, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOXYE72YFFQ5B7SEKSTWGGOUZANCNFSM47GPOV2Q . You are receiving this because you modified the open/close state.Message ID: @.***>

giadefa avatar Nov 02 '22 00:11 giadefa

Being able to override settings from the command line is still useful, at least for some settings. For example, to train several copies of the same model with different random number seeds. But the prior model is probably one of the less important ones to override from the command line.

peastman avatar Nov 02 '22 03:11 peastman

I'm in favour for the second version.

We are already using more structure inputs for the data loaders. Regarding the CLI, we maintain the ability to override simple options like seed.

raimis avatar Nov 02 '22 10:11 raimis