dpgen2 `lvl_f_lo` and `Outputs-model_devi` show `nan` during dpgen2 run

Issue Description

Dear developer, I encountered an issue while running dpgen2, where the output of dpgen2 status shows lvl_f_lo as nan. Additionally, the Outputs-model_devi file contains nan values for all relevant fields. Below are the details:

1. `dpgen2 status` Output

100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 13.33it/s]
#   stage  id_stg.    iter.      accu.      cand.      fail.   lvl_f_lo lvl_f_hi
# Stage    0  --------------------
        0        0        0     0.7646     0.1323     0.1032     0.2118   0.5000
        0        1        1     0.8519     0.1481     0.0000        nan   0.5000
        0        2        2     0.8519     0.1481     0.0000        nan   0.5000

2. Checking Output Files

Outputs-traj: The file /iter-000001--run-lmp-group/Outputs-traj contains valid data. Sample content:

ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
184
ITEM: BOX BOUNDS xy xz yz pp pp pp
...

Outputs-log: The log file also shows normal output:

LAMMPS (29 Aug 2024)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
...

3. `Outputs-model_devi` Contains `nan`

However, in the file /iter-000001--run-lmp-group/Outputs-model_devi, the avg_devi_f, max_devi_f, and min_devi_f fields all contain nan:

#       step         max_devi_v         min_devi_v         avg_devi_v         max_devi_f         min_devi_f         avg_devi_f
           0       0.000000e+00      1.797693e+308               -nan               -nan               -nan               -nan
          50       0.000000e+00      1.797693e+308               -nan               -nan               -nan               -nan
         100       0.000000e+00      1.797693e+308               -nan               -nan               -nan               -nan
...

Expected Behavior

The lvl_f_lo value and the fields in Outputs-model_devi should not contain nan.

Steps to Reproduce

Run the dpgen2 workflow with the provided input files.
Observe the dpgen2 status output and check the corresponding output files (Outputs-traj, Outputs-log, Outputs-model_devi).

Environment

dpgen2 version: 0.0.8.dev138+g2877e2f
DeepMD-kit version: 3.0.0b4
Operating website: Bohrium
Hardware: Bohrium V100*1

Thank you!

Nov 21 '24 06:11 Andy6M

could you please check if the very first configuration of the trajectory is a valid configuration? The model deviation of the first conf is 1.797693e+308 , which is unusual.

Nov 22 '24 00:11 wanghan-iapcm

Thank you for your response.

I checked the first configuration in iter-000000--run-lmp-000000 and examined the model_devi output. It appears that the min_devi_v values from all 17 LAMMPS runs are around the range of 1e-3 to 1e-2, and there are no significantly unusual deviations. Below is the relevant data from the model_devi file:

#       step         max_devi_v         min_devi_v         avg_devi_v         max_devi_f         min_devi_f         avg_devi_f
           0       2.813274e-02       1.236430e-03       1.493753e-02       1.294953e-01       1.464195e-02       5.355157e-02
          50       2.965354e-02       1.368058e-03       1.590805e-02       1.759406e-01       1.495048e-02       5.780621e-02
         100       2.838238e-02       8.987198e-04       1.456293e-02       3.704968e-01       1.995545e-02       6.168902e-02
         150       2.898888e-02       5.551084e-04       1.451846e-02       1.926066e-01       1.942737e-02       6.380191e-02
         200       2.859164e-02       1.362606e-03       1.502724e-02       1.506470e-01       1.966401e-02       6.074587e-02
         250       2.638238e-02       1.067859e-03       1.498770e-02       1.662188e-01       1.530976e-02       6.100733e-02
         300       3.096493e-02       1.346518e-03       1.662259e-02       1.287707e-01       9.415163e-03       6.065727e-02
         350       3.007970e-02       1.036756e-03       1.538427e-02       1.285052e-01       1.564005e-02       5.893896e-02
         400       2.847960e-02       1.593137e-03       1.575716e-02       1.448052e-01       2.011331e-02       5.807221e-02
         450       2.936185e-02       9.937412e-04       1.520512e-02       2.149763e-01       1.711258e-02       5.693326e-02
         500       2.865296e-02       1.305567e-03       1.580815e-02       1.537033e-01       1.503539e-02       5.641226e-02
         550       2.928813e-02       1.338427e-03       1.549072e-02       1.700321e-01       1.741005e-02       5.824711e-02
         600       3.045114e-02       1.881260e-03       1.695844e-02       1.468976e-01       1.580279e-02       6.133078e-02
         650       3.099498e-02       1.979718e-03       1.680618e-02       1.421608e-01       2.160320e-02       6.181459e-02
         700       3.274744e-02       1.210968e-03       1.797129e-02       1.831628e-01       1.325974e-02       6.625343e-02
         750       3.213804e-02       9.621239e-04       1.657447e-02       1.631825e-01       1.223954e-02       6.317999e-02
         800       2.790166e-02       6.471979e-04       1.548984e-02       1.701019e-01       1.454535e-02       6.039472e-02
         850       3.102427e-02       6.628643e-04       1.640118e-02       1.452935e-01       1.031704e-02       5.756204e-02
         900       3.017256e-02       1.184238e-03       1.628572e-02       1.617844e-01       1.736325e-02       5.396886e-02
         950       2.824300e-02       2.645454e-03       1.514339e-02       1.435542e-01       1.841572e-02       5.954156e-02
        1000       3.003184e-02       2.018773e-03       1.537587e-02       1.661056e-01       1.105155e-02       5.869294e-02

These values seem consistent and do not show any unusual spikes or extreme outliers. Let me know if there’s anything else I should check or if you need additional information.

Thank you for your kindness, Dr Wang!

Nov 22 '24 10:11 Andy6M

iteration 0 looks great and the issue happens at iteration 1. please check the quality of the model trained at iteration 1 and the initial configuration used in the iteration 1 MD simulations.

Nov 23 '24 05:11 wanghan-iapcm

Thank you for your response.

I checked the configuration in iter-000001--run-lmp-000000 and examined the model_devi output. Below is the relevant data from the model_devi file:

#       step         max_devi_v         min_devi_v         avg_devi_v         max_devi_f         min_devi_f         avg_devi_f
           0       1.767406e-02       1.923665e-03       9.226556e-03       1.744407e-01       1.733599e-02       5.988462e-02
          50       1.710528e-02       8.148907e-04       9.708625e-03       1.470154e-01       2.247298e-02       6.133791e-02
         100       1.648531e-02       7.205604e-04       8.514945e-03       2.283003e-01       1.731160e-02       6.284691e-02
         150       1.401277e-02       6.956906e-04       7.461165e-03       1.893883e-01       2.181478e-02       6.037625e-02
         200       1.122296e-02       8.039403e-04       6.388654e-03       1.278451e-01       2.261003e-02       6.208718e-02
         250       1.223884e-02       1.111902e-03       7.030710e-03       1.396129e-01       1.874105e-02       5.982687e-02
         300       1.274248e-02       4.881283e-04       7.156683e-03       1.538968e-01       1.926643e-02       5.871147e-02
         350       1.267785e-02       1.059288e-03       6.869723e-03       1.578075e-01       1.529583e-02       6.009399e-02
         400       1.606176e-02       9.227430e-04       8.678422e-03       1.700951e-01       1.348010e-02       6.259148e-02
         450       1.375797e-02       1.208132e-03       7.645918e-03       1.515705e-01       1.599934e-02       6.139468e-02
         500       1.452559e-02       1.445252e-03       8.073384e-03       2.422813e-01       2.254674e-02       6.374373e-02
         550       1.625274e-02       7.398480e-04       8.476174e-03       2.088750e-01       1.873348e-02       6.650365e-02
         600       1.597114e-02       1.153074e-03       9.110399e-03       2.131987e-01       2.254325e-02       6.565988e-02
         650       1.296594e-02       1.260483e-03       7.439792e-03       1.482121e-01       1.982959e-02       6.556800e-02
         700       1.314767e-02       1.855598e-03       7.715936e-03       1.393991e-01       2.082488e-02       6.037559e-02
         750       1.315291e-02       6.362476e-04       7.361491e-03       1.366818e-01       1.670856e-02       6.332969e-02
         800       1.396638e-02       8.883420e-04       7.119505e-03       1.791644e-01       1.593993e-02       6.285636e-02
         850       1.385305e-02       4.007663e-04       7.296780e-03       1.463808e-01       1.248193e-02       6.266973e-02
         900       1.382617e-02       1.950630e-03       7.863833e-03       1.453877e-01       2.297074e-02       6.257215e-02
         950       1.706155e-02       1.537314e-03       9.033908e-03       1.546805e-01       1.645491e-02       5.942675e-02
        1000       1.764165e-02       6.392657e-04       9.215594e-03       1.208138e-01       1.399952e-02       5.731977e-02

Thank you for your kindness, Dr Wang!

Nov 26 '24 04:11 Andy6M

`lvl_f_lo` and `Outputs-model_devi` show `nan` during dpgen2 run

Issue Description

1. dpgen2 status Output

2. Checking Output Files

3. Outputs-model_devi Contains nan

Expected Behavior

Steps to Reproduce

Environment

1. `dpgen2 status` Output

3. `Outputs-model_devi` Contains `nan`