deepmd-kit
deepmd-kit copied to clipboard
[BUG] DPA2 Lammps on nopbc systems causes torchscript error
Bug summary
When using a trained and frozen DPA2 model to run LAMMPS on nopbc systems, the program immediately raises a TorchScript error. Notably, this issue does not occur with DPA1 and se_a models in PyTorch, and the DPA2 model functions correctly on pbc systems, even with one-dimensional pbc.
DeePMD-kit Version
3.0.0b3
Backend and its version
PyTorch v2.1.2
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
- Train and freeze a dpa2 model in
examples/water/dpa2, - Modify
p p ptof f fof the lammps inputin.lammpsand link the frozen model inexamples/water/lmp, - Run
lmp -i in.lammps.
Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.0005
ERROR on proc 0: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/deepmd/pt/model/model/transform_output.py", line 156, in forward_lower
vvi = split_vv1[_44]
svvi = split_svv1[_44]
_45 = _36(vvi, svvi, coord_ext, do_virial, do_atomic_virial, create_graph, )
~~~ <--- HERE
ffi, aviri, = _45
ffi0 = torch.unsqueeze(ffi, -2)
File "code/__torch__/deepmd/pt/model/model/transform_output.py", line 191, in task_deriv_one
faked_grad = torch.ones_like(energy)
lst = annotate(List[Optional[Tensor]], [faked_grad])
_52 = torch.autograd.grad([energy], [extended_coord], lst, True, create_graph)
~~~~~~~~~~~~~~~~~~~ <--- HERE
extended_force = _52[0]
if torch.__isnot__(extended_force, None):
Traceback of TorchScript, original code (most recent call last):
File "/opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/deepmd/pt/model/model/transform_output.py", line 138, in forward_lower
for vvi, svvi in zip(split_vv1, split_svv1):
# nf x nloc x 3, nf x nloc x 9
ffi, aviri = task_deriv_one(
~~~~~~~~~~~~~~ <--- HERE
vvi,
svvi,
File "/opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/deepmd/pt/model/model/transform_output.py", line 80, in task_deriv_one
faked_grad = torch.ones_like(energy)
lst = torch.jit.annotate(List[Optional[torch.Tensor]], [faked_grad])
extended_force = torch.autograd.grad(
~~~~~~~~~~~~~~~~~~~ <--- HERE
[energy],
[extended_coord],
RuntimeError: max(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.
(/home/conda/feedstock_root/build_artifacts/deepmd-kit_1722057353391/work/source/lmp/pair_deepmd.cpp:586)
Last command: run 1000
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Steps to Reproduce
See above.
Further Information, Files, and Links
No response
It seems not easy to resolve so far.
- Lammps using pytorch dpa1 and se_a works for nopbc systems.
- dp test always work for nopbc systems.
- Lammps using DPA2 even with 0 layer repformers still crashes.
Maybe it's a bug with border_op in torchscript in nopbc system?
xref: #4092
#4220 indicates that segfault is still thrown with the MPI.
Fixed by #4237.