qmcpack icon indicating copy to clipboard operation
qmcpack copied to clipboard

Fix occasional optimizer failure.

Open markdewing opened this issue 1 year ago • 5 comments

Sometimes no eigenvalue is found from the mapping (filter and sort), which leads the code to select the first eigenvalue returned from the solver, which often sends the optimizer parameters in a very undesirable direction.

The results of this situation look something like this qmca output:

                            LocalEnergy               Variance           ratio 
opt1_j2_7_6  series 0  -74.870455 +/- 0.024794   5.933186 +/- 0.535541   0.0792 
opt1_j2_7_6  series 1  -75.001842 +/- 0.037365   7.256994 +/- 2.932719   0.0968 
opt1_j2_7_6  series 2  -74.929538 +/- 0.016134   4.463142 +/- 0.154065   0.0596 
opt1_j2_7_6  series 3  -38087874.983748 +/- 13522761.916354   52176250689832632320.000000 +/- 20073852128999710720.000000   1369891355506.0911 
opt1_j2_7_6  series 4  -35506733.280610 +/- 11037598.694781   51117489304034312192.000000 +/- 17345565406866282496.000000   1439656216753.3770 
opt1_j2_7_6  series 5  -9073700.394155 +/- 3700257.273053   11376395351848120320.000000 +/- 5387829541237716992.000000   1253776833889.7781 

The workaround is to detect the situation where no acceptable eigenvalue is found, and run a second filter which accepts eigenvalues above the current VMC energy ( E_0 = H(0,0) ). It cuts off at E_0 + 100, to mirror the lower cutoff.

The solution could be simpler, but the fix was chosen to keep the existing eigenvalue selection unchanged as much as possible.

Some notes on the details of eigenvalue selection:

Why 2.0? I don't know. It was introduced in 4b5b0a3f81ef25412eee1350ad579eef62a8f3f3 (2011). Prior to that change, the code searched for the eigenvalue closest to E_0 that falls between E_0 and E_0 - 1e10. After that change, the range was narrowed to E_0 and E_0 - 100, and the code searched for the eigenvalue closest to E_0 - 2.0. My guess is that a slightly lower eigenvalue might lead to faster convergence if there are a number of eigenvalues close to E_0 (??) but that really ought to be tested.

Why E_0 - 100 for the cutoff? I also don't know. My guess is it's to cut off obviously spurious eigenvalues which could arise from noise. Since this is a fixed value, it could cause problems for systems with very large total energies.

What type(s) of changes does this code introduce?

Delete the items that do not apply

  • Bugfix

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

desktop

Checklist

Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is simply a reminder of what we are going to look for before merging your code.

  • Yes. This PR is up to date with current the current state of 'develop'
  • Yes. Code added or changed in the PR has been clang-formatted
  • No. This PR adds tests to cover any new code, or to catch a bug that is being fixed
  • No. Documentation has been added (if appropriate)

markdewing avatar Feb 09 '24 06:02 markdewing

Test this please

prckent avatar Feb 09 '24 23:02 prckent

Very interesting! Thanks Mark. Can you describe the situation that triggered this?

prckent avatar Feb 09 '24 23:02 prckent

The context is testing for #4901. The cusp correction depends on a discrete parameter (locations where the second derivative of the correction matches the second derivative the basis function). I had set up some scripts to scan across that parameter, and this problem with the optimizer would show up in a few percent of the cases.

I had seen it before with some orbital rotation optimization, but hadn't tracked down the issue (It doesn't require orbital rotation to be involved - the problem could show up just optimizing J2 and J1, though it was more frequent when also optimizing J3)

markdewing avatar Feb 12 '24 16:02 markdewing

It doesn't seem the case that E_0 = H(0,0) but E_0 = (H * Inv)[0,0] where inv is the inverse of overlap matrix.

ye-luo avatar Feb 12 '24 18:02 ye-luo

For the overlap matrix, the first row and column are the identity matrix, so the inverse also ends up as the identity matrix in the first row and column, and (H * Inv)[0, 0] ends up being the same as H[0,0].

markdewing avatar Feb 12 '24 20:02 markdewing

Test this please

prckent avatar Apr 29 '24 17:04 prckent

Test this please

prckent avatar Apr 29 '24 20:04 prckent