nequip
nequip copied to clipboard
make `EnergyModel` pickable
Description
The current models created with EnergyModel
are not pickable. For instance, a deepcopy
of such model will fail with the current master branch (see below for an example).
The problem comes from using activation functions directly from torch.nn.functional
(which are not pickable) instead of torch.nn
(which are torch.nn.Module
hence piclable).
from copy import deepcopy
from nequip.model import model_from_config
from nequip.scripts.deploy import _compile_for_deploy
import yaml
import torch
conf = """
model:
model_builders:
- EnergyModel
- ForceOutput
BesselBasis_trainable: true
PolynomialCutoff_p: 6
chemical_symbol_to_type:
H: 0
O: 1
num_types: 2
type_names:
- H
- O
avg_num_neighbors: null
chemical_embedding_irreps_out: 16x0e
conv_to_output_hidden_irreps_out: 32x0e
feature_irreps_hidden: 32x0o + 32x0e + 16x1o + 16x1e
invariant_layers: 2
invariant_neurons: 64
irreps_edge_sh: 0e + 1o
nonlinearity_gates:
e: ssp
o: abs
nonlinearity_scalars:
e: ssp
o: tanh
nonlinearity_type: gate
num_basis: 8
num_layers: 2
r_max: 2.5
resnet: true
use_sc: true
"""
config = yaml.safe_load(conf)
model = model_from_config(config['model'], initialize=True)
deepcopy(model)
Motivation and Context
A piclable model can be saved easily and used in a multiprocessing context.
How Has This Been Tested?
The code above runs without errors and the tests are passing.
Types of changes
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds or improves functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Documentation improvement (updates to user guides, docstrings, or developer docs)
Checklist:
- [x] My code follows the code style of this project and has been formatted using
black
. - [x] All new and existing tests passed.
- [ ] I have added tests that cover my changes (if relevant).
- [ ] The option documentation (
docs/options
) has been updated with new or changed options. - [ ] I have updated
CHANGELOG.md
. - [ ] I have updated the documentation (if relevant).
Thanks @felixmusil for the PR, that all looks good. Have you run explicit checks that torch.nn.functional.x
and torch.nn.X
give identical answers?
Since it seems like you're dealing with pretty low-level stuff in the code, just wanted to extend that if you have any other questions, please feel free to reach out to us any time.
Best, Simon
Hi @felixmusil,
Thanks for the PR. Beyond Simon's point, while I'm happy to merge this, I want to be clear that we dropped the use of pickling for models / any commitment to support it as of 0.4.0.
The reason for this is that it carries a lot of internal noise, unnecessarily constrains development, and introduces a strong dependence of saved models on the internals of library code even when nothing fundamental has changed. (This is particularly relevant to e3nn, whose internals often change even when the interfaces are stable.)
The new supported approach is to rebuild the model from the configuration wherever you need it, and to save the state_dict()
of the model you want to send/save and reload it in the new models. This is how our trainer now saves model state and has the benefit of being unaffected by internal library details, as well as being easy to open, inspect, and modify.
Actually, also, have you tested that this actually works once this fix is included? I remember there being other things broken with pickling as well...
(Another warning: the ability to pickle e3nn models is something I implemented in that library with a lot of hacks that are somewhat sensitive to what device you built the object on and what devices are available where you load it...)
Thanks for the quick reply.
Have you run explicit checks that torch.nn.functional.x and torch.nn.X give identical answers?
Running this snippet shows that the results are the same within machine precision.
x = torch.linspace(-10, 10, 100000)
for func1,func2 in [(torch.nn.functional.silu, torch.nn.SiLU()),
(torch.nn.functional.softplus, torch.nn.Softplus()),
(torch.tanh, torch.nn.Tanh())]:
y1 = func1(x)
y2 = func2(x)
print(torch.allclose(y1,y1))
Actually, also, have you tested that this actually works once this fix is included? I remember there being other things broken with pickling as well...
yes, if you run the failing example above with this fix, it works.
I want to be clear that we dropped the use of pickling for models / any commitment to support it as of 0.4.0.
ok I see. I understand your point about saving and your new saving strategy is the way to go. However pickling is essential to work with multiprocessing applications, e.g. multi gpu training.
An alternative way to go about it would be to overload __setstate__
and __getstate__
at some strategic points and use the prefered serialization strategy (see https://stackoverflow.com/questions/1939058/simple-example-of-use-of-setstate-and-getstate for an example).
Running this snippet shows that the results are the same within machine precision. yes, if you run the failing example above with this fix, it works.
Great, thanks for checking.
However pickling is essential to work with multiprocessing applications, e.g. multi gpu training.
I'm not actually sure that this is true--- can you elaborate on your setup? We are beginning to look into multi-GPU training and so far it seems to me that the most natural approach is for each process to build it's own copy of the model and to then distribute rank 0's model.state_dict()
(which is a simple Dict[str, torch.Tensor]
) using the appropriate mechanism for your parallelism library. The other ranks can then load it with model.load_state_dict(...)
.
This should also involve less IPC and IO than full pickled models, because then you don't have to serialize/de-serialize TorchScript blobs for generated code from e3nn. (The only way I was able to make e3nn's dynamically codegen'd modules pickle-compatible was to have them serialize their compiled, optimized TorchScript cores using a __getstate__
/__setstate__
approach like you mentioned; see here.)
Hi @felixmusil just curious if the above worked for you or if this is still an issue :slightly_smiling_face:
Hi @felixmusil, it looks like you're making edits on this branch but the PR is still open.
We do not wrap the atoms in NequIP since our neighborlists respect unwrapped positions and wrapping them would not allow us to do simulation in which we want to keep track of unwrapped positions, e.g. computing the diffusivity.
Shall we close this PR?
Hi @felixmusil ,
I'm seeing a few more commits to this PR— is this still something you are trying to do, or can we close this PR? (Was rebuilding the model in child processes sufficient for you?)
I'm also seeing that you made this "remove empty dim" commit— what is its purpose, and is it related to some issue with shapes you may have found?
Thanks!
Hi @Linux-cpp-lisp,
Sorry for disappearing on you. I reverted the unnecessary commits and updated the PR branch to the current develop
. So it's finally ready to be merged unless you have some additional comments or requirements?
I'm not actually sure that this is true--- can you elaborate on your setup?
You are right, it's really tied with me using pytorch_lightning
for training. It provides many nice features but also some requirements like pickling for saving states of the hyperparameters.
We do not wrap the atoms in NequIP since our neighborlists respect unwrapped positions and wrapping them would not allow us to do simulation in which we want to keep track of unwrapped positions, e.g. computing the diffusivity.
I understand that and I removed this commit from the PR.
I'm also seeing that you made this "remove empty dim" commit— what is its purpose, and is it related to some issue with shapes you may have found?
For some reason, I have a corner case where an additional empty dimension (of size 1) was present which was not compatible with the rest of the computations.
@felixmusil sounds good, I'll take a look at this again soon here for merging.
Re multi-GPU support, take a look at this recent issue: https://github.com/mir-group/nequip/issues/210. Basically, unless you are really married to PyTorch Lightning, I would recommend looking into our horovod
branch, which will be the "official" solution in the future.
For some reason, I have a corner case where an additional empty dimension (of size 1) was present which was not compatible with the rest of the computations.
Hm yeah this sounds like it could be a bug on our part... are you able to reproduce this still on the latest NequIP develop
, and if so, are you willing to share a reproducing example? Not a big deal, just want to be sure we aren't missing some bug.
Re multi-GPU support, take a look at this recent issue: https://github.com/mir-group/nequip/issues/210. Basically, unless you are really married to PyTorch Lightning, I would recommend looking into our horovod branch, which will be the "official" solution in the future.
Thanks for the suggestion, I will have a look at how you handled the problem. I have a few other models working with PyTorch Lightning
though, so I would like to stick with it for a while longer.
Hm yeah this sounds like it could be a bug on our part... are you able to reproduce this still on the latest NequIP develop, and if so, are you willing to share a reproducing example? Not a big deal, just want to be sure we aren't missing some bug.
I will raise an issue/PR if I manage to reproduce it.
Hi @felixmusil ,
I came back to this again now after a while and I thought I would commit it (I've added a commit off of current develop
with you as a coauthor, see the branch: https://github.com/mir-group/nequip/tree/pickle).
However, for reasons I'm not sure of (call stack/module stack depth?) this has a non-negligible negative effect on speed, so I will not be merging it. Closing this PR for that reason, but the pickle
branch is there as a version of this off the latest develop
for those who absolutely need it.
(See above for discussion of our supported state_dict()
approach for saving models.)