[BUG] Model converted from PT to TF backend could not run with TF
Bug summary
I am now working on multi-task training with DeePMD-kit v3.0.0b0, and I get a header with se_a descriptor after freezing step. Then, I tried to use dp --pt convert-backend frozen_model.pth frozen_model.pb (and without--pt, getting the same result.) to get a frozen_model.pb. But it could not be used when running Lammps with both v2.2.9 and v3.0.0b0, raising the following error:
Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.0005
INVALID_ARGUMENT: 2 root error(s) found.
(0) INVALID_ARGUMENT: Input to reshape is a tensor with 504000 values, but the requested shape requires a multiple of 1608
[[{{node Reshape_33}}]]
[[o_atom_energy/_37]]
(1) INVALID_ARGUMENT: Input to reshape is a tensor with 504000 values, but the requested shape requires a multiple of 1608
[[{{node Reshape_33}}]]
0 successful operations.
0 derived errors ignored.
ERROR on proc 0: DeePMD-kit C API Error: DeePMD-kit Error: TensorFlow Error: INVALID_ARGUMENT: 2 root error(s) found.
(0) INVALID_ARGUMENT: Input to reshape is a tensor with 504000 values, but the requested shape requires a multiple of 1608
[[{{node Reshape_33}}]]
[[o_atom_energy/_37]]
(1) INVALID_ARGUMENT: Input to reshape is a tensor with 504000 values, but the requested shape requires a multiple of 1608
[[{{node Reshape_33}}]]
0 successful operations.
0 derived errors ignored. (/public/groups/ai4ec/libs/conda/deepmd/3.0.0b0-cuda118/source/deepmd-kit/source/lmp/pair_deepmd.cpp:586)
Last command: run ${NSTEPS} upto
It seems something wrong when converting the model, and seems to be a bug.
DeePMD-kit Version
DeePMD-kit v3.0.0b0
Backend and its version
PyTorch v2.0.0.post200, TensorFlow v2.14.0
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
Running command:
dp --pt freeze -o frozen_model.pth --head ener
dp convert-backend frozen_model.pth frozen_model.pb
or use --pt.
And the Lammps error log is under below. slurm-2623892.txt
Steps to Reproduce
Please use the following frozen_model.pth to freeze and use the following Lammps task to reproduce the bug.
Further Information, Files, and Links
No response
DescrptDPA1Compat has the wrong get_dim_out() when concat_output_tebd is true. cc @iProzd
Fixed in #4007.
Reopen. #4007 may not fix this issue, which needs more validation.
#4320 should fix the issue.