dpgen2 icon indicating copy to clipboard operation
dpgen2 copied to clipboard

`lvl_f_lo` and `Outputs-model_devi` show `nan` during dpgen2 run

Open Andy6M opened this issue 1 year ago • 4 comments

Issue Description

Dear developer, I encountered an issue while running dpgen2, where the output of dpgen2 status shows lvl_f_lo as nan. Additionally, the Outputs-model_devi file contains nan values for all relevant fields. Below are the details:


1. dpgen2 status Output

100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 13.33it/s]
#   stage  id_stg.    iter.      accu.      cand.      fail.   lvl_f_lo lvl_f_hi
# Stage    0  --------------------
        0        0        0     0.7646     0.1323     0.1032     0.2118   0.5000
        0        1        1     0.8519     0.1481     0.0000        nan   0.5000
        0        2        2     0.8519     0.1481     0.0000        nan   0.5000

2. Checking Output Files

  • Outputs-traj: The file /iter-000001--run-lmp-group/Outputs-traj contains valid data. Sample content:
ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
184
ITEM: BOX BOUNDS xy xz yz pp pp pp
...
  • Outputs-log: The log file also shows normal output:
LAMMPS (29 Aug 2024)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
...

3. Outputs-model_devi Contains nan

However, in the file /iter-000001--run-lmp-group/Outputs-model_devi, the avg_devi_f, max_devi_f, and min_devi_f fields all contain nan:

#       step         max_devi_v         min_devi_v         avg_devi_v         max_devi_f         min_devi_f         avg_devi_f
           0       0.000000e+00      1.797693e+308               -nan               -nan               -nan               -nan
          50       0.000000e+00      1.797693e+308               -nan               -nan               -nan               -nan
         100       0.000000e+00      1.797693e+308               -nan               -nan               -nan               -nan
...

Expected Behavior

The lvl_f_lo value and the fields in Outputs-model_devi should not contain nan.


Steps to Reproduce

  1. Run the dpgen2 workflow with the provided input files.
  2. Observe the dpgen2 status output and check the corresponding output files (Outputs-traj, Outputs-log, Outputs-model_devi).

Environment

  • dpgen2 version: 0.0.8.dev138+g2877e2f
  • DeepMD-kit version: 3.0.0b4
  • Operating website: Bohrium
  • Hardware: Bohrium V100*1

Thank you!


Andy6M avatar Nov 21 '24 06:11 Andy6M

could you please check if the very first configuration of the trajectory is a valid configuration? The model deviation of the first conf is 1.797693e+308 , which is unusual.

wanghan-iapcm avatar Nov 22 '24 00:11 wanghan-iapcm

Thank you for your response.

I checked the first configuration in iter-000000--run-lmp-000000 and examined the model_devi output. It appears that the min_devi_v values from all 17 LAMMPS runs are around the range of 1e-3 to 1e-2, and there are no significantly unusual deviations. Below is the relevant data from the model_devi file:

#       step         max_devi_v         min_devi_v         avg_devi_v         max_devi_f         min_devi_f         avg_devi_f
           0       2.813274e-02       1.236430e-03       1.493753e-02       1.294953e-01       1.464195e-02       5.355157e-02
          50       2.965354e-02       1.368058e-03       1.590805e-02       1.759406e-01       1.495048e-02       5.780621e-02
         100       2.838238e-02       8.987198e-04       1.456293e-02       3.704968e-01       1.995545e-02       6.168902e-02
         150       2.898888e-02       5.551084e-04       1.451846e-02       1.926066e-01       1.942737e-02       6.380191e-02
         200       2.859164e-02       1.362606e-03       1.502724e-02       1.506470e-01       1.966401e-02       6.074587e-02
         250       2.638238e-02       1.067859e-03       1.498770e-02       1.662188e-01       1.530976e-02       6.100733e-02
         300       3.096493e-02       1.346518e-03       1.662259e-02       1.287707e-01       9.415163e-03       6.065727e-02
         350       3.007970e-02       1.036756e-03       1.538427e-02       1.285052e-01       1.564005e-02       5.893896e-02
         400       2.847960e-02       1.593137e-03       1.575716e-02       1.448052e-01       2.011331e-02       5.807221e-02
         450       2.936185e-02       9.937412e-04       1.520512e-02       2.149763e-01       1.711258e-02       5.693326e-02
         500       2.865296e-02       1.305567e-03       1.580815e-02       1.537033e-01       1.503539e-02       5.641226e-02
         550       2.928813e-02       1.338427e-03       1.549072e-02       1.700321e-01       1.741005e-02       5.824711e-02
         600       3.045114e-02       1.881260e-03       1.695844e-02       1.468976e-01       1.580279e-02       6.133078e-02
         650       3.099498e-02       1.979718e-03       1.680618e-02       1.421608e-01       2.160320e-02       6.181459e-02
         700       3.274744e-02       1.210968e-03       1.797129e-02       1.831628e-01       1.325974e-02       6.625343e-02
         750       3.213804e-02       9.621239e-04       1.657447e-02       1.631825e-01       1.223954e-02       6.317999e-02
         800       2.790166e-02       6.471979e-04       1.548984e-02       1.701019e-01       1.454535e-02       6.039472e-02
         850       3.102427e-02       6.628643e-04       1.640118e-02       1.452935e-01       1.031704e-02       5.756204e-02
         900       3.017256e-02       1.184238e-03       1.628572e-02       1.617844e-01       1.736325e-02       5.396886e-02
         950       2.824300e-02       2.645454e-03       1.514339e-02       1.435542e-01       1.841572e-02       5.954156e-02
        1000       3.003184e-02       2.018773e-03       1.537587e-02       1.661056e-01       1.105155e-02       5.869294e-02

These values seem consistent and do not show any unusual spikes or extreme outliers. Let me know if there’s anything else I should check or if you need additional information.

Thank you for your kindness, Dr Wang!

Andy6M avatar Nov 22 '24 10:11 Andy6M

iteration 0 looks great and the issue happens at iteration 1. please check the quality of the model trained at iteration 1 and the initial configuration used in the iteration 1 MD simulations.

wanghan-iapcm avatar Nov 23 '24 05:11 wanghan-iapcm

Thank you for your response.

I checked the configuration in iter-000001--run-lmp-000000 and examined the model_devi output. Below is the relevant data from the model_devi file:

#       step         max_devi_v         min_devi_v         avg_devi_v         max_devi_f         min_devi_f         avg_devi_f
           0       1.767406e-02       1.923665e-03       9.226556e-03       1.744407e-01       1.733599e-02       5.988462e-02
          50       1.710528e-02       8.148907e-04       9.708625e-03       1.470154e-01       2.247298e-02       6.133791e-02
         100       1.648531e-02       7.205604e-04       8.514945e-03       2.283003e-01       1.731160e-02       6.284691e-02
         150       1.401277e-02       6.956906e-04       7.461165e-03       1.893883e-01       2.181478e-02       6.037625e-02
         200       1.122296e-02       8.039403e-04       6.388654e-03       1.278451e-01       2.261003e-02       6.208718e-02
         250       1.223884e-02       1.111902e-03       7.030710e-03       1.396129e-01       1.874105e-02       5.982687e-02
         300       1.274248e-02       4.881283e-04       7.156683e-03       1.538968e-01       1.926643e-02       5.871147e-02
         350       1.267785e-02       1.059288e-03       6.869723e-03       1.578075e-01       1.529583e-02       6.009399e-02
         400       1.606176e-02       9.227430e-04       8.678422e-03       1.700951e-01       1.348010e-02       6.259148e-02
         450       1.375797e-02       1.208132e-03       7.645918e-03       1.515705e-01       1.599934e-02       6.139468e-02
         500       1.452559e-02       1.445252e-03       8.073384e-03       2.422813e-01       2.254674e-02       6.374373e-02
         550       1.625274e-02       7.398480e-04       8.476174e-03       2.088750e-01       1.873348e-02       6.650365e-02
         600       1.597114e-02       1.153074e-03       9.110399e-03       2.131987e-01       2.254325e-02       6.565988e-02
         650       1.296594e-02       1.260483e-03       7.439792e-03       1.482121e-01       1.982959e-02       6.556800e-02
         700       1.314767e-02       1.855598e-03       7.715936e-03       1.393991e-01       2.082488e-02       6.037559e-02
         750       1.315291e-02       6.362476e-04       7.361491e-03       1.366818e-01       1.670856e-02       6.332969e-02
         800       1.396638e-02       8.883420e-04       7.119505e-03       1.791644e-01       1.593993e-02       6.285636e-02
         850       1.385305e-02       4.007663e-04       7.296780e-03       1.463808e-01       1.248193e-02       6.266973e-02
         900       1.382617e-02       1.950630e-03       7.863833e-03       1.453877e-01       2.297074e-02       6.257215e-02
         950       1.706155e-02       1.537314e-03       9.033908e-03       1.546805e-01       1.645491e-02       5.942675e-02
        1000       1.764165e-02       6.392657e-04       9.215594e-03       1.208138e-01       1.399952e-02       5.731977e-02

Thank you for your kindness, Dr Wang!

Andy6M avatar Nov 26 '24 04:11 Andy6M