`lvl_f_lo` and `Outputs-model_devi` show `nan` during dpgen2 run
Issue Description
Dear developer, I encountered an issue while running dpgen2, where the output of dpgen2 status shows lvl_f_lo as nan. Additionally, the Outputs-model_devi file contains nan values for all relevant fields. Below are the details:
1. dpgen2 status Output
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 13.33it/s]
# stage id_stg. iter. accu. cand. fail. lvl_f_lo lvl_f_hi
# Stage 0 --------------------
0 0 0 0.7646 0.1323 0.1032 0.2118 0.5000
0 1 1 0.8519 0.1481 0.0000 nan 0.5000
0 2 2 0.8519 0.1481 0.0000 nan 0.5000
2. Checking Output Files
-
Outputs-traj: The file/iter-000001--run-lmp-group/Outputs-trajcontains valid data. Sample content:
ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
184
ITEM: BOX BOUNDS xy xz yz pp pp pp
...
-
Outputs-log: The log file also shows normal output:
LAMMPS (29 Aug 2024)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
...
3. Outputs-model_devi Contains nan
However, in the file /iter-000001--run-lmp-group/Outputs-model_devi, the avg_devi_f, max_devi_f, and min_devi_f fields all contain nan:
# step max_devi_v min_devi_v avg_devi_v max_devi_f min_devi_f avg_devi_f
0 0.000000e+00 1.797693e+308 -nan -nan -nan -nan
50 0.000000e+00 1.797693e+308 -nan -nan -nan -nan
100 0.000000e+00 1.797693e+308 -nan -nan -nan -nan
...
Expected Behavior
The lvl_f_lo value and the fields in Outputs-model_devi should not contain nan.
Steps to Reproduce
- Run the
dpgen2workflow with the provided input files. - Observe the
dpgen2 statusoutput and check the corresponding output files (Outputs-traj,Outputs-log,Outputs-model_devi).
Environment
- dpgen2 version: 0.0.8.dev138+g2877e2f
- DeepMD-kit version: 3.0.0b4
- Operating website: Bohrium
- Hardware: Bohrium V100*1
Thank you!
could you please check if the very first configuration of the trajectory is a valid configuration? The model deviation of the first conf is 1.797693e+308 , which is unusual.
Thank you for your response.
I checked the first configuration in iter-000000--run-lmp-000000 and examined the model_devi output. It appears that the min_devi_v values from all 17 LAMMPS runs are around the range of 1e-3 to 1e-2, and there are no significantly unusual deviations. Below is the relevant data from the model_devi file:
# step max_devi_v min_devi_v avg_devi_v max_devi_f min_devi_f avg_devi_f
0 2.813274e-02 1.236430e-03 1.493753e-02 1.294953e-01 1.464195e-02 5.355157e-02
50 2.965354e-02 1.368058e-03 1.590805e-02 1.759406e-01 1.495048e-02 5.780621e-02
100 2.838238e-02 8.987198e-04 1.456293e-02 3.704968e-01 1.995545e-02 6.168902e-02
150 2.898888e-02 5.551084e-04 1.451846e-02 1.926066e-01 1.942737e-02 6.380191e-02
200 2.859164e-02 1.362606e-03 1.502724e-02 1.506470e-01 1.966401e-02 6.074587e-02
250 2.638238e-02 1.067859e-03 1.498770e-02 1.662188e-01 1.530976e-02 6.100733e-02
300 3.096493e-02 1.346518e-03 1.662259e-02 1.287707e-01 9.415163e-03 6.065727e-02
350 3.007970e-02 1.036756e-03 1.538427e-02 1.285052e-01 1.564005e-02 5.893896e-02
400 2.847960e-02 1.593137e-03 1.575716e-02 1.448052e-01 2.011331e-02 5.807221e-02
450 2.936185e-02 9.937412e-04 1.520512e-02 2.149763e-01 1.711258e-02 5.693326e-02
500 2.865296e-02 1.305567e-03 1.580815e-02 1.537033e-01 1.503539e-02 5.641226e-02
550 2.928813e-02 1.338427e-03 1.549072e-02 1.700321e-01 1.741005e-02 5.824711e-02
600 3.045114e-02 1.881260e-03 1.695844e-02 1.468976e-01 1.580279e-02 6.133078e-02
650 3.099498e-02 1.979718e-03 1.680618e-02 1.421608e-01 2.160320e-02 6.181459e-02
700 3.274744e-02 1.210968e-03 1.797129e-02 1.831628e-01 1.325974e-02 6.625343e-02
750 3.213804e-02 9.621239e-04 1.657447e-02 1.631825e-01 1.223954e-02 6.317999e-02
800 2.790166e-02 6.471979e-04 1.548984e-02 1.701019e-01 1.454535e-02 6.039472e-02
850 3.102427e-02 6.628643e-04 1.640118e-02 1.452935e-01 1.031704e-02 5.756204e-02
900 3.017256e-02 1.184238e-03 1.628572e-02 1.617844e-01 1.736325e-02 5.396886e-02
950 2.824300e-02 2.645454e-03 1.514339e-02 1.435542e-01 1.841572e-02 5.954156e-02
1000 3.003184e-02 2.018773e-03 1.537587e-02 1.661056e-01 1.105155e-02 5.869294e-02
These values seem consistent and do not show any unusual spikes or extreme outliers. Let me know if there’s anything else I should check or if you need additional information.
Thank you for your kindness, Dr Wang!
iteration 0 looks great and the issue happens at iteration 1. please check the quality of the model trained at iteration 1 and the initial configuration used in the iteration 1 MD simulations.
Thank you for your response.
I checked the configuration in iter-000001--run-lmp-000000 and examined the model_devi output. Below is the relevant data from the model_devi file:
# step max_devi_v min_devi_v avg_devi_v max_devi_f min_devi_f avg_devi_f
0 1.767406e-02 1.923665e-03 9.226556e-03 1.744407e-01 1.733599e-02 5.988462e-02
50 1.710528e-02 8.148907e-04 9.708625e-03 1.470154e-01 2.247298e-02 6.133791e-02
100 1.648531e-02 7.205604e-04 8.514945e-03 2.283003e-01 1.731160e-02 6.284691e-02
150 1.401277e-02 6.956906e-04 7.461165e-03 1.893883e-01 2.181478e-02 6.037625e-02
200 1.122296e-02 8.039403e-04 6.388654e-03 1.278451e-01 2.261003e-02 6.208718e-02
250 1.223884e-02 1.111902e-03 7.030710e-03 1.396129e-01 1.874105e-02 5.982687e-02
300 1.274248e-02 4.881283e-04 7.156683e-03 1.538968e-01 1.926643e-02 5.871147e-02
350 1.267785e-02 1.059288e-03 6.869723e-03 1.578075e-01 1.529583e-02 6.009399e-02
400 1.606176e-02 9.227430e-04 8.678422e-03 1.700951e-01 1.348010e-02 6.259148e-02
450 1.375797e-02 1.208132e-03 7.645918e-03 1.515705e-01 1.599934e-02 6.139468e-02
500 1.452559e-02 1.445252e-03 8.073384e-03 2.422813e-01 2.254674e-02 6.374373e-02
550 1.625274e-02 7.398480e-04 8.476174e-03 2.088750e-01 1.873348e-02 6.650365e-02
600 1.597114e-02 1.153074e-03 9.110399e-03 2.131987e-01 2.254325e-02 6.565988e-02
650 1.296594e-02 1.260483e-03 7.439792e-03 1.482121e-01 1.982959e-02 6.556800e-02
700 1.314767e-02 1.855598e-03 7.715936e-03 1.393991e-01 2.082488e-02 6.037559e-02
750 1.315291e-02 6.362476e-04 7.361491e-03 1.366818e-01 1.670856e-02 6.332969e-02
800 1.396638e-02 8.883420e-04 7.119505e-03 1.791644e-01 1.593993e-02 6.285636e-02
850 1.385305e-02 4.007663e-04 7.296780e-03 1.463808e-01 1.248193e-02 6.266973e-02
900 1.382617e-02 1.950630e-03 7.863833e-03 1.453877e-01 2.297074e-02 6.257215e-02
950 1.706155e-02 1.537314e-03 9.033908e-03 1.546805e-01 1.645491e-02 5.942675e-02
1000 1.764165e-02 6.392657e-04 9.215594e-03 1.208138e-01 1.399952e-02 5.731977e-02
Thank you for your kindness, Dr Wang!