deepmd-kit icon indicating copy to clipboard operation
deepmd-kit copied to clipboard

[BUG] DPA2 Lammps on nopbc systems causes torchscript error

Open iProzd opened this issue 1 year ago • 3 comments

Bug summary

When using a trained and frozen DPA2 model to run LAMMPS on nopbc systems, the program immediately raises a TorchScript error. Notably, this issue does not occur with DPA1 and se_a models in PyTorch, and the DPA2 model functions correctly on pbc systems, even with one-dimensional pbc.

DeePMD-kit Version

3.0.0b3

Backend and its version

PyTorch v2.1.2

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

  1. Train and freeze a dpa2 model in examples/water/dpa2,
  2. Modify p p p to f f f of the lammps input in.lammps and link the frozen model in examples/water/lmp,
  3. Run lmp -i in.lammps.
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.0005
ERROR on proc 0: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/deepmd/pt/model/model/transform_output.py", line 156, in forward_lower
    vvi = split_vv1[_44]
    svvi = split_svv1[_44]
    _45 = _36(vvi, svvi, coord_ext, do_virial, do_atomic_virial, create_graph, )
          ~~~ <--- HERE
    ffi, aviri, = _45
    ffi0 = torch.unsqueeze(ffi, -2)
  File "code/__torch__/deepmd/pt/model/model/transform_output.py", line 191, in task_deriv_one
  faked_grad = torch.ones_like(energy)
  lst = annotate(List[Optional[Tensor]], [faked_grad])
  _52 = torch.autograd.grad([energy], [extended_coord], lst, True, create_graph)
        ~~~~~~~~~~~~~~~~~~~ <--- HERE
  extended_force = _52[0]
  if torch.__isnot__(extended_force, None):

Traceback of TorchScript, original code (most recent call last):
  File "/opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/deepmd/pt/model/model/transform_output.py", line 138, in forward_lower
    for vvi, svvi in zip(split_vv1, split_svv1):
        # nf x nloc x 3, nf x nloc x 9
        ffi, aviri = task_deriv_one(
                     ~~~~~~~~~~~~~~ <--- HERE
            vvi,
            svvi,
  File "/opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/deepmd/pt/model/model/transform_output.py", line 80, in task_deriv_one
    faked_grad = torch.ones_like(energy)
    lst = torch.jit.annotate(List[Optional[torch.Tensor]], [faked_grad])
    extended_force = torch.autograd.grad(
                     ~~~~~~~~~~~~~~~~~~~ <--- HERE
        [energy],
        [extended_coord],
RuntimeError: max(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.
 (/home/conda/feedstock_root/build_artifacts/deepmd-kit_1722057353391/work/source/lmp/pair_deepmd.cpp:586)
Last command: run             1000
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

Steps to Reproduce

See above.

Further Information, Files, and Links

No response

iProzd avatar Sep 27 '24 09:09 iProzd

It seems not easy to resolve so far.

  1. Lammps using pytorch dpa1 and se_a works for nopbc systems.
  2. dp test always work for nopbc systems.
  3. Lammps using DPA2 even with 0 layer repformers still crashes.

Maybe it's a bug with border_op in torchscript in nopbc system?

iProzd avatar Sep 27 '24 09:09 iProzd

xref: #4092

njzjz avatar Sep 27 '24 14:09 njzjz

#4220 indicates that segfault is still thrown with the MPI.

njzjz avatar Oct 15 '24 22:10 njzjz

Fixed by #4237.

njzjz avatar Oct 23 '24 18:10 njzjz