psi4 icon indicating copy to clipboard operation
psi4 copied to clipboard

MBIS_VOLUME_RATIOS with num_frozen_docc set

Open tallakahath opened this issue 3 years ago • 17 comments

I have situations where I want to manually specify the number of frozen core orbitals using num_frozen_docc, do some work, then calculate some MBIS properties including the volume ratios. The latter causes psi4 to automatically calculate the free-atom volumes in the same method as the last-run calculation (in my case, wb97m-d3bj) and inherits all the settings... including num_frozen_docc. I can insert a set num_frozen_docc 0; set freeze_core true right before the oeprop(wfn, 'MULLIKEN_CHARGES') call and make things go away, but what worries me is just how things fail.

Well, sometimes things are in fact perfectly fine. Other times, psi4 segfaults. The behavior depends on just how large num_frozen_docc is -- I think if it ends up exceeding some other quality dependent on the calculation, things go belly-up. Here's the relevant line from a coredump.

#0  0x00002b4f362ef958 in psi::scf::HF::compute_fcpi (this=0x223c06c0) at [snip]/psi4/psi4/src/psi4/libscf_solver/hf.cc:830
830             for (int i = 0; i < nfzc; ++i) frzcpi_[pairs[i].second]++;

And here's a minimum working example of the issue that MBIS_VOLUME_RATIOS runs into, as a standalone calc:

set num_frozen_docc 20
set basis def2-tzvppd
set reference uks
molecule mol {
0 2
H 0 0 0
units angstrom
}
E, wf = energy('wb97m-d3bj',return_wfn=True,)

If you flip num_frozen_docc to 18, it works. (19 works for me on some machines, not on others -- eek!). In all cases I'm invoking psi4 the same way: psi4 -n1 -i run.in --memory 500MB.

I'm building off of 1.6.X, using icc/2020.2-108-02c7; I don't think I have anything "special" set in my compile environment besides buildtype=RelWithDebInfo. I didn't mess with any optimization flags, for sure.

tallakahath avatar Jul 12 '22 19:07 tallakahath

Thanks for the detailed report. I can reproduce the segfault, though for the benefit of other developers, I'll point out that this can be reproduced with scf instead of wb97m-d3bj. Sophisticated DFT functionality is not the issue here. This looks like a missing option validation at runtime.

I'll add that my advisor is also unhappy with the way Psi4 assumes options for precisely situations like these, but I have a few other things to patch before I touch options passing.

JonathonMisiewicz avatar Jul 12 '22 20:07 JonathonMisiewicz

would a psi4.core.clean_options() between the two work sessions be helpful in the meantime?

a more targeted but confusing way would be with the "revoke" commands https://github.com/psi4/psi4/blob/master/tests/pywrap-checkrun-convcrit/input.dat#L53-L55 but not recommended.

loriab avatar Jul 12 '22 22:07 loriab

Independent of the options issues -- should something be patched/changed such that if the user/a program/etc passes a nonsense num_frozen_docc, the program gives a sensible error rather than segfaulting?

tallakahath avatar Jul 23 '22 17:07 tallakahath

Independent of the options issues -- should something be patched/changed such that if the user/a program/etc passes a nonsense num_frozen_docc, the program gives a sensible error rather than segfaulting?

Yes. Lori's comment was "trick you can use to prevent accidentally stumbling into this again," not a fix.

scf::HF::compute_fcpi probably needs a validation check...

JonathonMisiewicz avatar Jul 23 '22 20:07 JonathonMisiewicz

I've turned up some much nastier behavior then a segfault -- incorrect QM...

Take the attached input file run.txt and note that set num_frozen_docc 6 at the top there should potentially cause issues for the water molecule in the dimer. The alkane is perfectly OK with this, however, but should num_frozen_docc even be coming into play in a HF-level calc? I thought not, isn't it just for correlated calculations?

Anyway, in the run, the first SAPT0 step runs just fine -- really, the HF Is what we're looking at, and we get sensible numbers for the interaction:

  Total HF                         -0.10898644 [mEh]      -0.06839003 [kcal/mol]      -0.28614386 [kJ/mol]
  Total SAPT0                      -0.27387960 [mEh]      -0.17186205 [kcal/mol]      -0.71907080 [kJ/mol]

And the underlying HF energies are:

    Total Energy =                       -272.3598217308123139
    Total Energy =                       -196.3248586337326458
    Total Energy =                        -76.0348541106373688

But something goes Very Wrong in the second calculation. The water has moved, sure, but it's not done anything dramatic, really. It's still a water. The alkane hasn't moved at all. And yet...

  Total HF                       -510.71908624 [mEh]    -320.48106505 [kcal/mol]   -1340.89277619 [kJ/mol]
  Total SAPT0                    -512.45472131 [mEh]    -321.57019251 [kcal/mol]   -1345.44968545 [kJ/mol]

That's no good. Where did we go wrong?

In the water monomer energy, in fact:

    Total Energy =                       -272.3599706292600899
    Total Energy =                       -196.3248586336976587
    Total Energy =                        -75.5243929093220743

Note that the water there seems to be 1 hartree too "unstable".

If we instead take the second geometry and run it stand-alone, with the incorrect num_frozen_docc, we get a perfectly cromulent result:

  Total HF                         -0.25751717 [mEh]      -0.16159446 [kcal/mol]      -0.67611123 [kJ/mol]
  Total SAPT0                      -0.62788646 [mEh]      -0.39400470 [kcal/mol]      -1.64851567 [kJ/mol]

So something strange is going on here -- I'd expect either the incorrect num_frozen_docc to break everything, or to break nothing (or to break SAPT0 only, but not the HF calc, at least!). Given the segfault at sufficiently bad num_frozen_docc, and the out-of-bounds array reach, I'm guessing memory corruption?

tallakahath avatar Aug 06 '22 02:08 tallakahath

I see your results, too. What's the overall intent of the num_frozen_docc here? Are you wanting plain frozen core or is this preparation for something more complex? Note that we did address frozen core in sapt fairly recently, https://github.com/psi4/psi4/pull/2271, and that solution in itself is making use of num_frozen_docc. When I switch your first line to set freeze_core true, water is reasonable again.

  > grep -e "Final E" lizsapt2.out 
  @DF-RHF Final Energy:  -272.35982173084625
  @DF-RHF Final Energy:  -196.32485863378869
  @DF-RHF Final Energy:   -76.03485411063548
  @DF-RHF Final Energy:  -272.35997062909757
  @DF-RHF Final Energy:  -196.32485863377360
  @DF-RHF Final Energy:   -76.03485447822362

In SAPT, frozen-core will influence a delta-MP2 correction (irrelevant here) and possibly (I'm not sure if there are conditions) dispersion https://github.com/psi4/psi4/blob/master/tests/sapt10/input.dat#L59-L63 since those are mp2-like terms. You're right that HF energies themselves should be indifferent. Without experimenting much, I venture that the presence of the sapt fc correction and all the wfn passing w/i sapt is causing the bad QC.

Possibly num_frozen_docc should be disabled for SAPT, if newly fixed freeze_core=True works for you. Or else num_frozen_docc needs a separate fix.

loriab avatar Aug 06 '22 04:08 loriab

The overall intent is that we usually run psi4 from an outside workflow that runs exactly one single-point energy per infile (or a scan of geometrically-related single point energies, e.g., a dimer being pulled apart). Usually, each of these calculations represents ONLY a monomer, or the dimer. We have a lookup table of how many orbitals we want to freeze in post-HF calculations, and the most visible way to apply that setting was to just set num_frozen_docc.

For example, if we wanted to calculate the interaction energy of a water-alkane dimer at the MP2 level, we'd have three calculations:

  • A in the AB basis with num_frozen_docc set accordingly for A
  • B in the AB basis with num_frozen_docc set accordingly for B
  • AB in the AB basis with num_frozen_docc set accordingly for AB

This scheme worked very well, until it came time to run SAPT0 calcs. As I admittedly hadn't thought through the effect of an incorrect n_frozen_docc on the monomer calcs done within the SAPT0 dimer calc (and presumed them to be at the HF level anyway), I didn't think to check for correct behavior.

The only place where our frozen-core lookup table disagrees with psi4's is for some more exotic elements (transition metals), so for now my solution is to just use freeze_core true for all SAPT0 calcs and go about my life. And when I thought the bad setting just caused segfaults, I was fine to assume that all calculations that ran to completion were obviously OK. Now seeing that there's some shade of undefined behavior leading to memory corruption possibly going on, I'd like to understand what is going wrong so I can figure out what's likely impacted... cases where the energy is obviously and egregiously wrong (like above) are easy, but I'm worried more about subtle incorrectness.

As for the general fix -- if num_frozen_docc is disabled for SAPT0, then it becomes impossible for a user to impose their own beliefs about frozen orbitals in tricky cases like transition metals, which seems... bad. Since it does impact the dispersion portion of the calc, it needs to be user-mutable. This setting shouldn't be actually impacting the HF energies, and fixing that strange interaction seems like most of the battle. I'm less sure of what to do in the MP2 case, since it does seem relevant that the user should be able to specify their own core policy as required...

At the risk of further complication, would it make sense to implement either:

  1. The ability for num_frozen_docc to take in a tuple the same length of the number of fragments in the active Molecule (such that correct behavior can be inferred in cases like SAPT where both monomer and dimer calcs are run in a global context)
  2. The ability to define, via a list(?), a custom policy on a per-element basis that can be fed to freeze_core?

Both of these would allow both for the user to handle cases like custom frozen thresholds in multi-calc situations and keep things consistent. At the same time, these are... possibly a lot of work. I'm prototyping a solution of type (2) but I'm not sure if it'll be too clunky for others to use -- assuming I haven't messed up horribly, I'll propose the patch later today.

tallakahath avatar Aug 06 '22 17:08 tallakahath

I've written a patch to allow for a custom frozen policy that is a bit clunky but safer than num_frozen_docc (because it applies per-atom rules, so works properly for cases like SAPT, MBIS_VOLUME_RATIOS, etc) -- once I can be sure this builds in vanilla psi4 (trying now) I will re-run my test jobs and submit the patch.

Still not sure what to do about scf::HF::compute_fcpi because I'm still a newbie to the code base (and this patch does NOT fix that issue).

tallakahath avatar Aug 08 '22 17:08 tallakahath

Just to make sure I understand the issues here: The first one is clear. A segfault occurs if num_frozen_docc is "too large." What exactly "too large" means is unclear, but more frozen docc pairs than electron pairs is sufficient. The second one is less clear. Obviously, the QC variables controlling the HF energy are getting grabbed incorrectly, but are these two geometries supposed to be different?

JonathonMisiewicz avatar Aug 09 '22 20:08 JonathonMisiewicz

These two geometries are different, but they involve the same monomers, and I'm saving some time on the second set of HF calcs by starting from the first set of orbitals (or so I'd hope!). I think the monomers might be exactly the same, even (just one is moved in-space).

Mainly, the (incorrect) setting of num_frozen_docc causes a fuse to blow with HF in the second calculation, which is Odd for a number of reasons.

tallakahath avatar Aug 09 '22 21:08 tallakahath

Okay, so there are three different issues here:

  1. A segfault occurs if num_frozen_docc is "too large." What exactly "too large" means is unclear, but more frozen docc pairs than electron pairs is sufficient.
  2. Under certain conditions, num_frozen_docc causes very bad HF energies in the SAPT printout. The mechanism for this is unclear, but the individual HF computations themselves seem fine.
  3. The handling of frozen core orbitals is not flexible enough for your purposes. #2667 fixes this issue but not the other two.

Do I have all that right? Any other issues I've missed?

JonathonMisiewicz avatar Aug 09 '22 21:08 JonathonMisiewicz

No, it looks like in (2) you're actually getting a bad HF calculation -- even in the HF calculation part, the numbers are Wrong (for the lone water in this case).

(1) and (3) look fine in terms of Bad Behavior I've seen. (and with (3)/freeze_core=True when my element sets are simple I've pretty much moved away from the issue anyway)

tallakahath avatar Aug 09 '22 21:08 tallakahath

Based on the output file, nalpha and nbeta are obviously wrong (up from 5 to 6 for this computation), presumably due to num_frozen_docc handling. I doubt this second issue is related to the num_frozen_docc segfault.

  Charge       = 0
  Multiplicity = 1
  Electrons    = 10
  Nalpha       = 5
  Nbeta        = 5

...

  ==> Pre-Iterations <==

  SCF Guess: Orbitals guess was supplied from a previous computation.

   -------------------------------------------------------
    Irrep   Nso     Nmo     Nalpha   Nbeta   Ndocc  Nsocc
   -------------------------------------------------------
     A         28      28       6       6       6       0
   -------------------------------------------------------
    Total      28      28       6       6       6       0
   -------------------------------------------------------

I can look into what's contaminating the occupation count for (2), but #2619 needs to come in first.

Is there anything else that's wrong in my summary?

JonathonMisiewicz avatar Aug 09 '22 21:08 JonathonMisiewicz

Looks good to me! Thanks for looking into this!

tallakahath avatar Aug 09 '22 22:08 tallakahath

Issue (2) is because frzcpi() is wrong during the first set of calculations. At wavefunction save time, it's 0 for the supersystem, 6 for monomer A, and 6 for monomer B. Why that is requires further investigation.

JonathonMisiewicz avatar Aug 10 '22 21:08 JonathonMisiewicz

What's happening on (2) is as follows:

  • For the supersystem computation, #2271 means that the supersystem frozen core is the sum of monomer A and monomer B frozen core. Both of those are set to zero right now, which is bad, but not the direct cause of the issue.
  • For both monomers, num_frozen_docc 6 means Psi thinks there are 6 frozen orbitals when those monomer computations run. That doesn't affect energies, but that does contaminate the wavefunction.
  • All three wavefunctions are saved
  • After reading the monomer B wavefunction, Psi asks monomer B for its occupied orbitals
  • When computing its occupied orbitals, monomer B realizes it has 6 frozen orbitals, so it must be at least 6 and therefore returns 6 occupied orbitals
  • With a garbage number of occupied orbitals, Psi computes a garbage SCF energy

The primary issue here is that for SAPT supersystem computations, Psi doesn't split num_frozen_docc into Monomer A frozen docc and Monomer B frozen docc. (nbody may well have the same problem.)

Idea 1: For reasonable frozen cores, we could plausibly do the supersystem computation, assume the core orbitals are localized on monomers, see which monomers the core orbitals are localized on, and use that to work out the docc per subsystem. This is not distributed and will fail for large frozen cores, where the localization assumption fails.

Idea 2: num_frozen_docc simply should not be used for supersystem computations. We need a different keyword that has the user specify this for each elementary system. just like they do charges and spin multiplicities.

JonathonMisiewicz avatar Aug 11 '22 11:08 JonathonMisiewicz

On the one hand, for cases like SAPT/auto-CP/etc, I'd be fine with just disabling num_frozen_docc -- it's clearly inappropriate in any supersystem calculation where the user knows at the energy call that it's going to run sub-systems.

But going upthread back to the original issue, there's use-cases like MBIS_VOLUME_RATIOS that are an add-on to what would otherwise be a valid standalone calculation to use num_frozen_docc with (a monomer calc), and there's still the question of what should happen there. You can't just guess based on calc name, because you don't know if the user is going to call oeprop down-stream. Disabling num_frozen_docc for any calculation where this could happen would effectively ban the keyword.

A reasonableness check, or a good guess, could be inserted before any calculation done on a new mol if num_frozen_docc is set in the global scope -- with a warning. Or simply error out if a new mol is calculated without num_frozen_docc having been updated (not sure how reasonable that is?).

This is but one user's thoughts, of course.

tallakahath avatar Aug 11 '22 15:08 tallakahath