Bug summary

我首先在集群A上使用GPU训练出模型graph.pth，然后使用该模型在集群B的CPU节点运行LAMMPS，此时两个集群都是通过离线方式安装了GPU版本3.1.0的DeePMD-kit，在LAMMPS跑MD的过程中出错了。我的MD过程有两段：第一段是找到平衡态，第二段是施加剪切。我后续又进行了多次测试，总结错误一共两种：第一种是在第一段MD运行中途出错了，报错内容是“ 477 14.671651 -1840.8705 -1840.4779 951095.46 1007.5899 957687.5 947820.98 947777.91 31.101126 2.3915695 13.546425 0 0 0
478 14.848564 -1840.4616 -1840.0643 956489.71 1006.907 963265.36 953626.93 952576.83 31.093733 2.3909153 13.544167 0 0 0
ERROR on proc 34: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/torch/deepmd/pt/model/model/ener_model.py", line 66, in forward_lower comm_dict: Optional[Dict[str, Tensor]]=None) -> Dict[str, Tensor]: _6 = (self).need_sorted_nlist_for_lower() model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, comm_dict, _6, ) ~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE _7 = (self).get_fitting_net() model_predict = annotate(Dict[str, Tensor], {}) File "code/torch/deepmd/pt/model/model/ener_model.py", line 232, in forward_common_lower cc_ext, _40, fp, ap, input_prec, = _39 atomic_model = self.atomic_model atomic_ret = (atomic_model).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, comm_dict, ) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE _41 = (self).atomic_output_def() training = self.training File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 53, in forward_common_atomic ext_atom_mask = (self).make_atom_mask(extended_atype, ) _3 = torch.where(ext_atom_mask, extended_atype, 0) ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, ) ~~~~~~~~~~~~~~~~~~~~ <--- HERE ret_dict0 = (self).apply_out_stat(ret_dict, atype, ) _4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc) File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 96, in forward_atomic pass descriptor = self.descriptor _16 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, comm_dict, ) ~~~~~~~~~~~~~~~~~~~ <--- HERE descriptor0, rot_mat, g2, h2, sw, = _16 enable_eval_descriptor_hook = self.enable_eval_descriptor_hook File "code/torch/deepmd/pt/model/descriptor/dpa3.py", line 53, in forward node_ebd_inp = torch.slice(_2, 2) repflows = self.repflows _3 = (repflows).forward(nlist, extended_coord0, extended_atype, node_ebd_ext, mapping, comm_dict, ) ~~~~~~~~~~~~~~~~~ <--- HERE node_ebd, edge_ebd, h2, rot_mat, sw, = _3 concat_output_tebd = self.concat_output_tebd File "code/torch/deepmd/pt/model/descriptor/repflows.py", line 326, in forward _72 = torch.tensor(real_nloc, dtype=3, device=torch.device("cpu")) _73 = torch.tensor(torch.sub(real_nall, real_nloc), dtype=3, device=torch.device("cpu")) ret = ops.deepmd.border_op(_66, _67, _68, _69, _70, node_ebd0, _71, _72, _73) ~~~~~~~~~~~~~~~~~~~~ <--- HERE node_ebd_ext1 = torch.unsqueeze(ret[0], 0) if has_spin:

Traceback of TorchScript, original code (most recent call last): File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/ener_model.py", line 119, in forward_lower comm_dict: Optional[dict[str, torch.Tensor]] = None, ): model_ret = self.forward_common_lower( ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE extended_coord, extended_atype, File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/make_model.py", line 287, in forward_common_lower ) del extended_coord, fparam, aparam atomic_ret = self.atomic_model.forward_common_atomic( ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE cc_ext, extended_atype, File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/atomic_model/base_atomic_model.py", line 249, in forward_common_atomic

    ext_atom_mask = self.make_atom_mask(extended_atype)
    ret_dict = self.forward_atomic(
               ~~~~~~~~~~~~~~~~~~~ <--- HERE
        extended_coord,
        torch.where(ext_atom_mask, extended_atype, 0),

File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 238, in forward_atomic if self.do_grad_r() or self.do_grad_c(): extended_coord.requires_grad_(True) descriptor, rot_mat, g2, h2, sw = self.descriptor( ~~~~~~~~~~~~~~~ <--- HERE extended_coord, extended_atype, File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/descriptor/dpa3.py", line 498, in forward node_ebd_inp = node_ebd_ext[:, :nloc, :] # repflows node_ebd, edge_ebd, h2, rot_mat, sw = self.repflows( ~~~~~~~~~~~~~ <--- HERE nlist, extended_coord, File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/descriptor/repflows.py", line 599, in forward assert "recv_num" in comm_dict assert "communicator" in comm_dict ret = torch.ops.deepmd.border_op( ~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE comm_dict["send_list"], comm_dict["send_proc"], RuntimeError: index out of range in self (/home/conda/feedstock_root/build_artifacts/deepmd-kit_1749640039377/work/source/lmp/pair_deepmd.cpp:253) Last command: run 500”（这段报错的第一行477和第二行478以及最后的500说明任务是中途终止的）；第二个错误是在前面一个的基础上增加neighbor 参数的skin的值，这使得第一段MD可以运行完成，但在即将开始第二段MD的时候出错了，报错内容是“Setting up Verlet run ... Unit style : metal Current step : 0 Time step : 0.001 ERROR on proc 141: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/torch/deepmd/pt/model/model/ener_model.py", line 66, in forward_lower comm_dict: Optional[Dict[str, Tensor]]=None) -> Dict[str, Tensor]: _6 = (self).need_sorted_nlist_for_lower() model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, comm_dict, _6, ) ~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE _7 = (self).get_fitting_net() model_predict = annotate(Dict[str, Tensor], {}) File "code/torch/deepmd/pt/model/model/ener_model.py", line 232, in forward_common_lower cc_ext, _40, fp, ap, input_prec, = _39 atomic_model = self.atomic_model atomic_ret = (atomic_model).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, comm_dict, ) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE _41 = (self).atomic_output_def() training = self.training File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 53, in forward_common_atomic ext_atom_mask = (self).make_atom_mask(extended_atype, ) _3 = torch.where(ext_atom_mask, extended_atype, 0) ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, ) ~~~~~~~~~~~~~~~~~~~~ <--- HERE ret_dict0 = (self).apply_out_stat(ret_dict, atype, ) _4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc) File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 96, in forward_atomic pass descriptor = self.descriptor _16 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, comm_dict, ) ~~~~~~~~~~~~~~~~~~~ <--- HERE descriptor0, rot_mat, g2, h2, sw, = _16 enable_eval_descriptor_hook = self.enable_eval_descriptor_hook File "code/torch/deepmd/pt/model/descriptor/dpa3.py", line 53, in forward node_ebd_inp = torch.slice(_2, 2) repflows = self.repflows _3 = (repflows).forward(nlist, extended_coord0, extended_atype, node_ebd_ext, mapping, comm_dict, ) ~~~~~~~~~~~~~~~~~ <--- HERE node_ebd, edge_ebd, h2, rot_mat, sw, = _3 concat_output_tebd = self.concat_output_tebd File "code/torch/deepmd/pt/model/descriptor/repflows.py", line 326, in forward _72 = torch.tensor(real_nloc, dtype=3, device=torch.device("cpu")) _73 = torch.tensor(torch.sub(real_nall, real_nloc), dtype=3, device=torch.device("cpu")) ret = ops.deepmd.border_op(_66, _67, _68, _69, _70, node_ebd0, _71, _72, _73) ~~~~~~~~~~~~~~~~~~~~ <--- HERE node_ebd_ext1 = torch.unsqueeze(ret[0], 0) if has_spin:

Traceback of TorchScript, original code (most recent call last): File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/ener_model.py", line 119, in forward_lower comm_dict: Optional[dict[str, torch.Tensor]] = None, ): model_ret = self.forward_common_lower( ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE extended_coord, extended_atype, File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/make_model.py", line 287, in forward_common_lower ) del extended_coord, fparam, aparam atomic_ret = self.atomic_model.forward_common_atomic( ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE cc_ext, extended_atype, File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/atomic_model/base_atomic_model.py", line 249, in forward_common_atomic

    ext_atom_mask = self.make_atom_mask(extended_atype)
    ret_dict = self.forward_atomic(
               ~~~~~~~~~~~~~~~~~~~ <--- HERE
        extended_coord,
        torch.where(ext_atom_mask, extended_atype, 0),

File "/public/ERROR on proc 162: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.”。前面提到我经过了多次测试，均失败了：1.描述符：在pytorch后端3.1.0的GPU版本下分别使用dpa3和se_e2_a训练的模型都在跑MD是失败了；2：在集群A使用GPU训练模型，然后在同个GPU上运行LAMMPS，任务失败了；3：使用同样的模型，使用DeePMD-kit 3.1.0的CPU版本或者GPU的LAMMPS来跑MD，都失败了。多次尝试后，有一次LAMMPS运行成功了，我使用pytorch后端和描述符se_e2_a训练出模型graph.pth,graph.pth在用到LAMMPS跑MD的时候失败了，然后我使用dp convert-backend graph.pth graph.pb命令把该模型又pytorch后端转换成了tensorflow后端，然后给到lammps跑md，这次就成功运行了。这似乎说明.pth模型在LAMMPS跑MD模拟过程中有些问题。而目前dpa3描述符确实精度比较好，在我的数据集下，dpa3精度是se_e2_a的一倍，但是dpa3只支持pytorch后端，我无法把它转换到tensorflow后端然后给到lammps跑MD。

DeePMD-kit Version

3.1.0

Backend and its version

PyTorch v2.6.0-gUnknown

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

LAMMPS运行命令：mpirun -n 192 /data/home/liuchang/deepmd-kit/bin/lmp -i in.tin -l lmp.log LAMMPS的输入文件in.tin:#######################initialization units metal

dimension 3

boundary p p p

atom_style atomic

#######################system definition

read_data structure.lmp

#######################simulation settings

set potential

pair_style deepmd graph.pb

pair_coeff * * C B

neighbor 3.5 nsq

neigh_modify every 1 delay 0 check yes

thermo output

thermo 1

thermo_style custom step temp pe etotal press vol pxx pyy pzz lx ly lz xy xz yz

thermo_modify flush yes lost error line one

timestep 0.001

#intial relax

dump RE all custom 1 int.traj id type x y z

velocity all create 10.0 32546 dist gaussian mom yes rot yes

fix MD all npt temp 10 10 0.1 x 0 1000000 1 y 0 1000000 1 z 0 1000000 1 #xy 0.0 0.0 1.0 xz 0.0 0.0 1.0 yz 0.0 0.0 1.0 #iso 0 0 1.0

run 500

unfix MD

undump RE

#load change_box all triclinic

reset_timestep 0

fix MD1 all npt temp 10 10 0.1 x 1000000 1000000 1 y 1000000 1000000 1 z 1000000 1000000 1

fix MD2 all deform 1 xz erate 0.002 units lattice remap x

dump dump all custom 50 load.traj id type x y z

#strain的0.000005与erate 0.005的0.005的关系是:后者是前者的1000倍,对应皮秒是飞秒的1000倍 variable strain equal step*0.000002 variable p1 equal "v_strain"

variable px equal "-pxx/10000" variable py equal "-pyy/10000" variable pz equal "-pzz/10000" variable psyz equal "-pyz/10000" variable psxz equal "-pxz/10000" variable psxy equal "-pxy/10000"

fix out1 all print 50 "${p1} ${px} ${py} ${pz} ${psyz} ${psxz} ${psxy}" file Stress.dat screen no

run 2000

Steps to Reproduce

DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.

Further Information, Files, and Links

No response

Dec 10 '25 02:12 myyelishu

Hi @myyelishu , could you:

Also provide the error log of se_a model of pytorch during LAMMPS MD.
Try the newst version of DeePMD-kit (version 3.1.1): https://github.com/deepmodeling/deepmd-kit/releases/tag/v3.1.1 which fixed an issue maybe relative (https://github.com/deepmodeling/deepmd-kit/pull/4844).

BTW, please use English in the github discussion.

Dec 10 '25 04:12 iProzd

Hi @myyelishu , could you:

Also provide the error log of se_a model of pytorch during LAMMPS MD.

Try the newst version of DeePMD-kit (version 3.1.1): https://github.com/deepmodeling/deepmd-kit/releases/tag/v3.1.1 which fixed an issue maybe relative (fix(cc): use insert_or_assign instead of insert #4844).

BTW, please use English in the github discussion.

First of all, se_a and dpa3 models trained on the pytorch backend will have the above two error cases during lammps MD operation. Therefore, the error seems to have little to do with descriptors, but with the pytorch backend. Later, I will try version 3.1.1, if the problem is solved, I will comment in the comments section. Finally, thank you for your reminder.

Dec 10 '25 05:12 myyelishu

[BUG] _DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

set potential

thermo output

Steps to Reproduce

Further Information, Files, and Links