deepmd-kit [BUG] Model converted from PT to TF backend could not run with TF

Bug summary

I am now working on multi-task training with DeePMD-kit v3.0.0b0, and I get a header with se_a descriptor after freezing step. Then, I tried to use dp --pt convert-backend frozen_model.pth frozen_model.pb (and without--pt, getting the same result.) to get a frozen_model.pb. But it could not be used when running Lammps with both v2.2.9 and v3.0.0b0, raising the following error:

Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.0005
INVALID_ARGUMENT: 2 root error(s) found.
  (0) INVALID_ARGUMENT: Input to reshape is a tensor with 504000 values, but the requested shape requires a multiple of 1608
	 [[{{node Reshape_33}}]]
	 [[o_atom_energy/_37]]
  (1) INVALID_ARGUMENT: Input to reshape is a tensor with 504000 values, but the requested shape requires a multiple of 1608
	 [[{{node Reshape_33}}]]
0 successful operations.
0 derived errors ignored.
ERROR on proc 0: DeePMD-kit C API Error: DeePMD-kit Error: TensorFlow Error: INVALID_ARGUMENT: 2 root error(s) found.
  (0) INVALID_ARGUMENT: Input to reshape is a tensor with 504000 values, but the requested shape requires a multiple of 1608
	 [[{{node Reshape_33}}]]
	 [[o_atom_energy/_37]]
  (1) INVALID_ARGUMENT: Input to reshape is a tensor with 504000 values, but the requested shape requires a multiple of 1608
	 [[{{node Reshape_33}}]]
0 successful operations.
0 derived errors ignored. (/public/groups/ai4ec/libs/conda/deepmd/3.0.0b0-cuda118/source/deepmd-kit/source/lmp/pair_deepmd.cpp:586)
Last command: run             ${NSTEPS} upto

It seems something wrong when converting the model, and seems to be a bug.

DeePMD-kit Version

DeePMD-kit v3.0.0b0

Backend and its version

PyTorch v2.0.0.post200, TensorFlow v2.14.0

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

Running command:

dp --pt freeze -o frozen_model.pth --head ener
dp convert-backend frozen_model.pth frozen_model.pb

or use --pt.

And the Lammps error log is under below. slurm-2623892.txt

Steps to Reproduce

Please use the following frozen_model.pth to freeze and use the following Lammps task to reproduce the bug.

Further Information, Files, and Links

No response

Jul 19 '24 07:07 Cloudac7

DescrptDPA1Compat has the wrong get_dim_out() when concat_output_tebd is true. cc @iProzd

Jul 19 '24 08:07 njzjz

Fixed in #4007.

Jul 26 '24 18:07 njzjz

Reopen. #4007 may not fix this issue, which needs more validation.

Oct 23 '24 06:10 njzjz

#4320 should fix the issue.

Nov 13 '24 18:11 njzjz