deepmd-kit [BUG] DPA2 Lammps on nopbc systems causes torchscript error

Bug summary

When using a trained and frozen DPA2 model to run LAMMPS on nopbc systems, the program immediately raises a TorchScript error. Notably, this issue does not occur with DPA1 and se_a models in PyTorch, and the DPA2 model functions correctly on pbc systems, even with one-dimensional pbc.

DeePMD-kit Version

3.0.0b3

Backend and its version

PyTorch v2.1.2

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

Train and freeze a dpa2 model in examples/water/dpa2,
Modify p p p to f f f of the lammps input in.lammps and link the frozen model in examples/water/lmp,
Run lmp -i in.lammps.

Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.0005
ERROR on proc 0: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/deepmd/pt/model/model/transform_output.py", line 156, in forward_lower
    vvi = split_vv1[_44]
    svvi = split_svv1[_44]
    _45 = _36(vvi, svvi, coord_ext, do_virial, do_atomic_virial, create_graph, )
          ~~~ <--- HERE
    ffi, aviri, = _45
    ffi0 = torch.unsqueeze(ffi, -2)
  File "code/__torch__/deepmd/pt/model/model/transform_output.py", line 191, in task_deriv_one
  faked_grad = torch.ones_like(energy)
  lst = annotate(List[Optional[Tensor]], [faked_grad])
  _52 = torch.autograd.grad([energy], [extended_coord], lst, True, create_graph)
        ~~~~~~~~~~~~~~~~~~~ <--- HERE
  extended_force = _52[0]
  if torch.__isnot__(extended_force, None):

Traceback of TorchScript, original code (most recent call last):
  File "/opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/deepmd/pt/model/model/transform_output.py", line 138, in forward_lower
    for vvi, svvi in zip(split_vv1, split_svv1):
        # nf x nloc x 3, nf x nloc x 9
        ffi, aviri = task_deriv_one(
                     ~~~~~~~~~~~~~~ <--- HERE
            vvi,
            svvi,
  File "/opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/deepmd/pt/model/model/transform_output.py", line 80, in task_deriv_one
    faked_grad = torch.ones_like(energy)
    lst = torch.jit.annotate(List[Optional[torch.Tensor]], [faked_grad])
    extended_force = torch.autograd.grad(
                     ~~~~~~~~~~~~~~~~~~~ <--- HERE
        [energy],
        [extended_coord],
RuntimeError: max(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.
 (/home/conda/feedstock_root/build_artifacts/deepmd-kit_1722057353391/work/source/lmp/pair_deepmd.cpp:586)
Last command: run             1000
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

Steps to Reproduce

See above.

Further Information, Files, and Links

No response

Sep 27 '24 09:09 iProzd

It seems not easy to resolve so far.

Lammps using pytorch dpa1 and se_a works for nopbc systems.
dp test always work for nopbc systems.
Lammps using DPA2 even with 0 layer repformers still crashes.

Maybe it's a bug with border_op in torchscript in nopbc system?

Sep 27 '24 09:09 iProzd

xref: #4092

Sep 27 '24 14:09 njzjz

#4220 indicates that segfault is still thrown with the MPI.

Oct 15 '24 22:10 njzjz

Fixed by #4237.

Oct 23 '24 18:10 njzjz