deepks-kit icon indicating copy to clipboard operation
deepks-kit copied to clipboard

Loss of Convergence in Iterative Loop (iter.00) Despite Initial Convergence Rates

Open Shangguanying992 opened this issue 11 months ago • 9 comments

While using DeepKS, I observed an anomaly regarding convergence rates during the initial iteration (iter. init) and subsequent iterative loop (iter.00).

During the first iteration (iter. init), the convergence rates were satisfactory with a training set rate of 0.77 and a testing set rate of 0.88. However, upon entering the iterative loop (iter.00), the convergence rates dropped to 0, rendering further computation impossible. Another user reported this issue as well. 屏幕截图 2024-03-06 145959

Additional Context: Software: DeepKS, abacus 3.5.3 The DeePKS water_single example can run correctly on my machine.

Thank you so much for your help in addressing this issue.

Shangguanying992 avatar Mar 06 '24 07:03 Shangguanying992

Thanks for the question. This happens occationally when the first iteration is trained so "hard" that it overfits the data. I would suggest reduce the number of training epoch for first iteration.

y1xiaoc avatar Mar 07 '24 20:03 y1xiaoc

I tried changing the value of n_epoch from 500 to 10, 100, but it seems that the situation does not change. Are there any other solutions?

Shangguanying992 avatar Mar 11 '24 11:03 Shangguanying992

To clarify, you may want to reduce the training length for the init train (iter.init), which is controlled by paramteres in init_train, like here in the example. I think the one you are changing is for the following iterations.

y1xiaoc avatar Mar 12 '24 02:03 y1xiaoc

I changed it in params.yaml, like the file below. Do I truly understand? Or do I need to add a file named args.yaml? params.yaml.zip

Shangguanying992 avatar Mar 12 '24 07:03 Shangguanying992

You may need to use the args.yaml. You can check the iter.init folder to see if the training has been redone with reduced epochs.

y1xiaoc avatar Mar 14 '24 00:03 y1xiaoc

I‘m sure that the training has been redone with a reduced epoch, but the convergence rate is still 0. In this case, is there a recommended n_epoch value? 屏幕截图 2024-03-15 162701 屏幕截图 2024-03-15 164426

Shangguanying992 avatar Mar 15 '24 08:03 Shangguanying992

You may try symmetrizing the descriptors via modifying the init_train block in params.yaml as suggested here. (See last part of params.yaml)

ouqi0711 avatar Mar 17 '24 02:03 ouqi0711

I have tried to modify the params.yaml as suggested, but it still has the same problem. I changed n_epoch and start_lr.

Shangguanying992 avatar Mar 17 '24 14:03 Shangguanying992