deepks-kit icon indicating copy to clipboard operation
deepks-kit copied to clipboard

RuntimeError: No system available during deepks model training.

Open Shangguanying992 opened this issue 1 year ago • 1 comments

While running the deepks model, I encountered the following error during the iteration process: err.iter: #data_train/group.00 no system.raw, infer meta from data #data_train/group.00 reset batch size to 0 #ignore empty dataset: data_train/group.00 Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/ljcgroup/.conda/envs/deepks/lib/python3.11/site-packages/deepks/model/train.py", line 303, in cli() File "/home/ljcgroup/.conda/envs/deepks/lib/python3.11/site-packages/deepks/main.py", line 71, in train_cli main(**argdict) File "/home/ljcgroup/.conda/envs/deepks/lib/python3.11/site-packages/deepks/model/train.py", line 270, in main g_reader = GroupReader(train_paths, **data_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ljcgroup/.conda/envs/deepks/lib/python3.11/site-packages/deepks/model/reader.py", line 207, in init raise RuntimeError("No system is available") RuntimeError: No system is available

Upon inspecting the iter.00/00.scf/log.data file, I noticed that none of the systems converged, leading to a lack of information for training. The content of log.data is as follows: Training: Convergence: 0 / 200 = 0.00000 Energy: ME: 4.768164229534069 MAE: 8.715617244519633 MARE: 7.972705559173027 Force: MAE: 1.0896364501030267 Testing: Convergence: 0 / 200 = 0.00000 Energy: ME: 4.768164229534069 MAE: 8.715617244519633 MARE: 7.972705559173027 Force: MAE: 1.0896364501030267 I have verified that the system configurations are reasonable. Currently, another user also encountered the same problem. Any guidance on resolving this issue would be appreciated.

Environment Information: The water_single example is functioning correctly. Thank you for your assistance!

Shangguanying992 avatar Jan 23 '24 13:01 Shangguanying992

The main problem is no configuration is converged. I would try train with smaller learning rate and fewer steps to see if the convergence rate can be larger than 0.9. If that rate is too low it is hard to learning anything from the unconverged data.

y1xiaoc avatar Jan 24 '24 16:01 y1xiaoc