[BUG] _DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Bug summary
我首先在集群A上使用GPU训练出模型graph.pth,然后使用该模型在集群B的CPU节点运行LAMMPS,此时两个集群都是通过离线方式安装了GPU版本3.1.0的DeePMD-kit,在LAMMPS跑MD的过程中出错了。我的MD过程有两段:第一段是找到平衡态,第二段是施加剪切。我后续又进行了多次测试,总结错误一共两种:第一种是在第一段MD运行中途出错了,报错内容是“ 477 14.671651 -1840.8705 -1840.4779 951095.46 1007.5899 957687.5 947820.98 947777.91 31.101126 2.3915695 13.546425 0 0 0
478 14.848564 -1840.4616 -1840.0643 956489.71 1006.907 963265.36 953626.93 952576.83 31.093733 2.3909153 13.544167 0 0 0
ERROR on proc 34: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/deepmd/pt/model/model/ener_model.py", line 66, in forward_lower
comm_dict: Optional[Dict[str, Tensor]]=None) -> Dict[str, Tensor]:
_6 = (self).need_sorted_nlist_for_lower()
model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, comm_dict, _6, )
~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_7 = (self).get_fitting_net()
model_predict = annotate(Dict[str, Tensor], {})
File "code/torch/deepmd/pt/model/model/ener_model.py", line 232, in forward_common_lower
cc_ext, _40, fp, ap, input_prec, = _39
atomic_model = self.atomic_model
atomic_ret = (atomic_model).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, comm_dict, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_41 = (self).atomic_output_def()
training = self.training
File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 53, in forward_common_atomic
ext_atom_mask = (self).make_atom_mask(extended_atype, )
_3 = torch.where(ext_atom_mask, extended_atype, 0)
ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, )
~~~~~~~~~~~~~~~~~~~~ <--- HERE
ret_dict0 = (self).apply_out_stat(ret_dict, atype, )
_4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc)
File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 96, in forward_atomic
pass
descriptor = self.descriptor
_16 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, comm_dict, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
descriptor0, rot_mat, g2, h2, sw, = _16
enable_eval_descriptor_hook = self.enable_eval_descriptor_hook
File "code/torch/deepmd/pt/model/descriptor/dpa3.py", line 53, in forward
node_ebd_inp = torch.slice(_2, 2)
repflows = self.repflows
_3 = (repflows).forward(nlist, extended_coord0, extended_atype, node_ebd_ext, mapping, comm_dict, )
~~~~~~~~~~~~~~~~~ <--- HERE
node_ebd, edge_ebd, h2, rot_mat, sw, = _3
concat_output_tebd = self.concat_output_tebd
File "code/torch/deepmd/pt/model/descriptor/repflows.py", line 326, in forward
_72 = torch.tensor(real_nloc, dtype=3, device=torch.device("cpu"))
_73 = torch.tensor(torch.sub(real_nall, real_nloc), dtype=3, device=torch.device("cpu"))
ret = ops.deepmd.border_op(_66, _67, _68, _69, _70, node_ebd0, _71, _72, _73)
~~~~~~~~~~~~~~~~~~~~ <--- HERE
node_ebd_ext1 = torch.unsqueeze(ret[0], 0)
if has_spin:
Traceback of TorchScript, original code (most recent call last): File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/ener_model.py", line 119, in forward_lower comm_dict: Optional[dict[str, torch.Tensor]] = None, ): model_ret = self.forward_common_lower( ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE extended_coord, extended_atype, File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/make_model.py", line 287, in forward_common_lower ) del extended_coord, fparam, aparam atomic_ret = self.atomic_model.forward_common_atomic( ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE cc_ext, extended_atype, File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/atomic_model/base_atomic_model.py", line 249, in forward_common_atomic
ext_atom_mask = self.make_atom_mask(extended_atype)
ret_dict = self.forward_atomic(
~~~~~~~~~~~~~~~~~~~ <--- HERE
extended_coord,
torch.where(ext_atom_mask, extended_atype, 0),
File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 238, in forward_atomic if self.do_grad_r() or self.do_grad_c(): extended_coord.requires_grad_(True) descriptor, rot_mat, g2, h2, sw = self.descriptor( ~~~~~~~~~~~~~~~ <--- HERE extended_coord, extended_atype, File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/descriptor/dpa3.py", line 498, in forward node_ebd_inp = node_ebd_ext[:, :nloc, :] # repflows node_ebd, edge_ebd, h2, rot_mat, sw = self.repflows( ~~~~~~~~~~~~~ <--- HERE nlist, extended_coord, File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/descriptor/repflows.py", line 599, in forward assert "recv_num" in comm_dict assert "communicator" in comm_dict ret = torch.ops.deepmd.border_op( ~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE comm_dict["send_list"], comm_dict["send_proc"], RuntimeError: index out of range in self (/home/conda/feedstock_root/build_artifacts/deepmd-kit_1749640039377/work/source/lmp/pair_deepmd.cpp:253) Last command: run 500”(这段报错的第一行477和第二行478以及最后的500说明任务是中途终止的);第二个错误是在前面一个的基础上增加neighbor 参数的skin的值,这使得第一段MD可以运行完成,但在即将开始第二段MD的时候出错了,报错内容是“Setting up Verlet run ... Unit style : metal Current step : 0 Time step : 0.001 ERROR on proc 141: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/torch/deepmd/pt/model/model/ener_model.py", line 66, in forward_lower comm_dict: Optional[Dict[str, Tensor]]=None) -> Dict[str, Tensor]: _6 = (self).need_sorted_nlist_for_lower() model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, comm_dict, _6, ) ~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE _7 = (self).get_fitting_net() model_predict = annotate(Dict[str, Tensor], {}) File "code/torch/deepmd/pt/model/model/ener_model.py", line 232, in forward_common_lower cc_ext, _40, fp, ap, input_prec, = _39 atomic_model = self.atomic_model atomic_ret = (atomic_model).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, comm_dict, ) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE _41 = (self).atomic_output_def() training = self.training File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 53, in forward_common_atomic ext_atom_mask = (self).make_atom_mask(extended_atype, ) _3 = torch.where(ext_atom_mask, extended_atype, 0) ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, ) ~~~~~~~~~~~~~~~~~~~~ <--- HERE ret_dict0 = (self).apply_out_stat(ret_dict, atype, ) _4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc) File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 96, in forward_atomic pass descriptor = self.descriptor _16 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, comm_dict, ) ~~~~~~~~~~~~~~~~~~~ <--- HERE descriptor0, rot_mat, g2, h2, sw, = _16 enable_eval_descriptor_hook = self.enable_eval_descriptor_hook File "code/torch/deepmd/pt/model/descriptor/dpa3.py", line 53, in forward node_ebd_inp = torch.slice(_2, 2) repflows = self.repflows _3 = (repflows).forward(nlist, extended_coord0, extended_atype, node_ebd_ext, mapping, comm_dict, ) ~~~~~~~~~~~~~~~~~ <--- HERE node_ebd, edge_ebd, h2, rot_mat, sw, = _3 concat_output_tebd = self.concat_output_tebd File "code/torch/deepmd/pt/model/descriptor/repflows.py", line 326, in forward _72 = torch.tensor(real_nloc, dtype=3, device=torch.device("cpu")) _73 = torch.tensor(torch.sub(real_nall, real_nloc), dtype=3, device=torch.device("cpu")) ret = ops.deepmd.border_op(_66, _67, _68, _69, _70, node_ebd0, _71, _72, _73) ~~~~~~~~~~~~~~~~~~~~ <--- HERE node_ebd_ext1 = torch.unsqueeze(ret[0], 0) if has_spin:
Traceback of TorchScript, original code (most recent call last): File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/ener_model.py", line 119, in forward_lower comm_dict: Optional[dict[str, torch.Tensor]] = None, ): model_ret = self.forward_common_lower( ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE extended_coord, extended_atype, File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/make_model.py", line 287, in forward_common_lower ) del extended_coord, fparam, aparam atomic_ret = self.atomic_model.forward_common_atomic( ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE cc_ext, extended_atype, File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/atomic_model/base_atomic_model.py", line 249, in forward_common_atomic
ext_atom_mask = self.make_atom_mask(extended_atype)
ret_dict = self.forward_atomic(
~~~~~~~~~~~~~~~~~~~ <--- HERE
extended_coord,
torch.where(ext_atom_mask, extended_atype, 0),
File "/public/ERROR on proc 162: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.”。前面提到我经过了多次测试,均失败了:1.描述符:在pytorch后端3.1.0的GPU版本下分别使用dpa3和se_e2_a训练的模型都在跑MD是失败了;2:在集群A使用GPU训练模型,然后在同个GPU上运行LAMMPS,任务失败了;3:使用同样的模型,使用DeePMD-kit 3.1.0的CPU版本或者GPU的LAMMPS来跑MD,都失败了。多次尝试后,有一次LAMMPS运行成功了,我使用pytorch后端和描述符se_e2_a训练出模型graph.pth,graph.pth在用到LAMMPS跑MD的时候失败了,然后我使用dp convert-backend graph.pth graph.pb命令把该模型又pytorch后端转换成了tensorflow后端,然后给到lammps跑md,这次就成功运行了。这似乎说明.pth模型在LAMMPS跑MD模拟过程中有些问题。而目前dpa3描述符确实精度比较好,在我的数据集下,dpa3精度是se_e2_a的一倍,但是dpa3只支持pytorch后端,我无法把它转换到tensorflow后端然后给到lammps跑MD。
DeePMD-kit Version
3.1.0
Backend and its version
PyTorch v2.6.0-gUnknown
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
LAMMPS运行命令:mpirun -n 192 /data/home/liuchang/deepmd-kit/bin/lmp -i in.tin -l lmp.log LAMMPS的输入文件in.tin:#######################initialization units metal
dimension 3
boundary p p p
atom_style atomic
#######################system definition
read_data structure.lmp
#######################simulation settings
set potential
pair_style deepmd graph.pb
pair_coeff * * C B
neighbor 3.5 nsq
neigh_modify every 1 delay 0 check yes
thermo output
thermo 1
thermo_style custom step temp pe etotal press vol pxx pyy pzz lx ly lz xy xz yz
thermo_modify flush yes lost error line one
timestep 0.001
#intial relax
dump RE all custom 1 int.traj id type x y z
velocity all create 10.0 32546 dist gaussian mom yes rot yes
fix MD all npt temp 10 10 0.1 x 0 1000000 1 y 0 1000000 1 z 0 1000000 1 #xy 0.0 0.0 1.0 xz 0.0 0.0 1.0 yz 0.0 0.0 1.0 #iso 0 0 1.0
run 500
unfix MD
undump RE
#load change_box all triclinic
reset_timestep 0
fix MD1 all npt temp 10 10 0.1 x 1000000 1000000 1 y 1000000 1000000 1 z 1000000 1000000 1
fix MD2 all deform 1 xz erate 0.002 units lattice remap x
dump dump all custom 50 load.traj id type x y z
#strain的0.000005与erate 0.005的0.005的关系是:后者是前者的1000倍,对应皮秒是飞秒的1000倍 variable strain equal step*0.000002 variable p1 equal "v_strain"
variable px equal "-pxx/10000" variable py equal "-pyy/10000" variable pz equal "-pzz/10000" variable psyz equal "-pyz/10000" variable psxz equal "-pxz/10000" variable psxy equal "-pxy/10000"
fix out1 all print 50 "${p1} ${px} ${py} ${pz} ${psyz} ${psxz} ${psxy}" file Stress.dat screen no
run 2000
Steps to Reproduce
DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Further Information, Files, and Links
No response
Hi @myyelishu , could you:
- Also provide the error log of se_a model of pytorch during LAMMPS MD.
- Try the newst version of DeePMD-kit (version 3.1.1): https://github.com/deepmodeling/deepmd-kit/releases/tag/v3.1.1 which fixed an issue maybe relative (https://github.com/deepmodeling/deepmd-kit/pull/4844).
BTW, please use English in the github discussion.
Hi @myyelishu , could you:
- Also provide the error log of se_a model of pytorch during LAMMPS MD.
- Try the newst version of DeePMD-kit (version 3.1.1): https://github.com/deepmodeling/deepmd-kit/releases/tag/v3.1.1 which fixed an issue maybe relative (fix(cc): use insert_or_assign instead of insert #4844).
BTW, please use English in the github discussion.
First of all, se_a and dpa3 models trained on the pytorch backend will have the above two error cases during lammps MD operation. Therefore, the error seems to have little to do with descriptors, but with the pytorch backend. Later, I will try version 3.1.1, if the problem is solved, I will comment in the comments section. Finally, thank you for your reminder.