QCSchema icon indicating copy to clipboard operation
QCSchema copied to clipboard

Detailed molecular basis set specification (output of a calculation, not input)

Open tovrstra opened this issue 7 years ago • 38 comments

Rough idea

The current schema does not yet have standardized method for specifying a Gaussian (or Slater) AO basis that was used in a calculation for a given molecule. Just to be clear: I'm looking for a way to fully specify a basis set in the output of a calculation, after the calculation has been carried out. This is different from specifying a basis set in the input for a calculation, because in the output you want to know how to interpret quantities expressed in an AO basis. For that you need more information, namely:

  • Order of the basis functions within one shell.
  • Normalization and sign conventions for the basis functions.
  • The center for each shell of basis functions.

Because basis functions are not necessarily centered on nuclei, an array with centers must be included as well.

I think we can mostly JSON-ize the molden format (see [GTO], [STO] and [MO] sections of http://www.cmbi.ru.nl/molden/molden_format.html) plus some improvements:

  • Make the ordering of the basis functions within one shell (and pure versus Cartesian) explicit as follows:

    {"conventions": {
      "d": ["xx", "yy", "zz", "xy", "xz", "yz"],
      "f": ["f0c", "f1c", "f1s", "f2c", "f2s", "-f3s", "-f3c"]
    }}
    

    Cartesian functions always match the x*y*z* regular expression. For pure functions, the string consists of four parts: sign (optional), angular momentum (letter code), magnetic quantum number (absolute value), s or c for sine- or cosine-like. The optional sign is needed because some codes (e.g. ORCA) have unusual sign conventions, different from most other QC codes. The line for the f-functions in the example is how ORCA orders basis functions (and flips signs) in AO arrays in a Molden file. (This is different from what the Molden program actually expects.) To fix the meaning of the strings for the pure functions completely, we should write out the mathematical form in the JSON schema.

  • Allow very high angular momenta, exhausting the whole alphabet (except j, see Wikipedia, more authoritative reference always welcome): ["s", "p", "d", "f", "g", "h", "i", "k", "l", "m", "n", "o", "q", "r", "t", "u", "v", "w", "x", "y", "z", "a", "b", "c", "e"].

  • Assume contractions are normalized already, i.e. do not assume that the program reading the contraction coefficients will fix the normalization for you. Instead, the program writiing out the contractions should take care of that. (The Molden format assumes contractions need to be normalized upon reading, which is not suitable for all cases.)

  • Support (very) generalized contractions, i.e. also compatible with basis sets from CP2K. See https://github.com/cp2k/cp2k/blob/master/cp2k/data/BASIS_MOLOPT (It does not get more general than that.) For every shell we should have a list of angular momenta, e.g. ["s", "p"], ["s", "s"] or ["s", "s", "s", "p", "p", "d"]. The last example is something you could encounter in a molopt basis set.

  • Keep a list of centers for the shells, which can be different from atomic positions, e.g. when using ghost atoms or doing other funny things.

  • Include pseudopotential specification? (I'm not an expert.)

  • Details for STO basis functions need to be worked out. (I'm not an expert on that either.)

Example for NH3

{"orbital_basis": {
  "type": "gto",
  "conventions": {
    "p": ["x", "y", "z"],
    "d": ["d0c", "d1c", "d1s", "d2c", "d2s"],
  },
  "shells": [
    [0, 8, ["s", "s"],
     [0.9046E+4, 0.1357E+4, 0.3093E+3, 0.8773E+2, 0.2856E+2, 0.1021E+2, 0.3838E+1, 0.7466],
     [0.6996174134E-3, 0.538605463E-2, 0.2739102119E-1, 0.103150592, 0.2785706633, 0.4482948495, 0.2780859284, 0.1543156123E-1],
     [-0.304990096E-3, -0.2408026379E-2, -0.1194444873E-1, -0.489259929E-1, -0.1344727247, -0.3151125777, -0.2428578325, 0.1094382207E+1],
    ],
    [0, 1, ["s"], [0.2248], [1.0]],
    [0, 1, ["s"], [0.6124E-1], [1.0]],
    [0, 3, ["p"],
     [0.1355E+2, 0.2917E+1, 0.7973],
     [0.5890567677E-1, 0.3204611067, 0.7530420618]],
    [0, 1, ["p"], [0.2185], [1.0]],
    [0, 1, ["p"], [0.5611E-1], [1.0]],
    [0, 1, ["d"], [0.817], [1.0]],
    [0, 1, ["d"], [0.23], [1.0]],
    [1, 3, ["s"],
     [0.1301E+2, 0.1962E+1, 0.4446],
     [0.3349872639E-1, 0.2348008012, 0.8136829579]],
    [1, 1, ["s"], [0.122], [1.0]],
    [1, 1, ["s"], [0.2974E-1], [1.0]],
    [1, 1, ["p"], [0.727], [1.0]],
    [1, 1, ["p"], [0.141], [1.0]],
    [2, 3, ["s"],
     [0.1301E+2, 0.1962E+1, 0.4446],
     [0.3349872639E-1, 0.2348008012, 0.8136829579]],
    [2, 1, ["s"], [0.122], [1.0]],
    [2, 1, ["s"], [0.2974E-1], [1.0]],
    [2, 1, ["p"], [0.727], [1.0]],
    [2, 1, ["p"], [0.141], [1.0]],
    [3, 3, ["s"],
     [0.1301E+2, 0.1962E+1, 0.4446],
     [0.3349872639E-1, 0.2348008012, 0.8136829579]],
    [3, 1, ["s"], [0.122], [1.0]],
    [3, 1, ["s"], [0.2974E-1], [1.0]],
    [3, 1, ["p"], [0.727], [1.0]],
    [3, 1, ["p"], [0.141], [1.0]],
  ],
  "centers": [
    [-0.0140883131, 0.0845903925, 0.1037711513]
    [1.4952113836, 0.0214187375, 0.0445603623]
    [-0.5919457779, -1.6621666211, 0.5350312215]
    [-0.7075176488, 0.4654193413, -2.0214243076]
  ]
}}

Details

The shells field is a list, where each item represents one generalized contraction stored as a list with the following items:

  • center index (counting from zero?)
  • the number of primitives
  • the angular momenta for the generalized contraction
  • the Gaussian exponents
  • one or more lists of contraction coefficients. (More are present in case of generalized contractions.)

To do

  • Units
  • Support different normalization conventions. Furthermore, Cartesian functions have different normalization constants within one shell, which is causing some ambiguity.

tovrstra avatar Aug 18 '17 08:08 tovrstra

I've updated the text above. The changes were mostly related to pure functions.

tovrstra avatar Aug 18 '17 11:08 tovrstra

Allow very high angular momenta: s, p, d, f, g, h, i, j, k, l, m, n, o. (Not sure what should be done after o?)

From Wikipedia:

The letters after the f sub-shell just follow letter f in alphabetical order except the letter j and those already used.

so o, q, r, t, u, v, w, x, y, z, ...

berquist avatar Aug 18 '17 12:08 berquist

@berquist Thanks! I'll fix this (and other remarks later) in the top post.

tovrstra avatar Aug 18 '17 12:08 tovrstra

@twindus @wadejong Do you guys have a link to the EMSL BSE XML spec?

dgasmith avatar Aug 18 '17 12:08 dgasmith

I think I found it: https://bse.pnl.gov/bse/docs/schemas/gaussian.xsd I'll take a look. I think this paper explains the xml spec to some extent: http://pubs.acs.org/doi/pdf/10.1021/ci600510j Are there other references?

There is also https://bse.pnl.gov/bse/docs/schemas/ecp.xsd for effective core potentials.

tovrstra avatar Aug 18 '17 12:08 tovrstra

Core potentials as well?

gkc1000 avatar Aug 18 '17 13:08 gkc1000

@gkc1000 Core potentials are not as urgent because they are not needed to visualize orbitals or to compute densities. They could be useful for more specialized post-processing, e.g. follow-up calculations with other levels of theory

tovrstra avatar Aug 18 '17 13:08 tovrstra

The EMSL BSE XML has core potentials AFAIK. If its not in the document its in another.

dgasmith avatar Aug 18 '17 13:08 dgasmith

There are discussions that have started regarding the BSE, it's XML, the need for extending the format, and possibly having JSON. Just FYI.

On Aug 18, 2017, at 6:20 AM, Daniel Smith [email protected] wrote:

The EMSL BSE XML has core potentials AFAIK. If its not in the document its in another.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

wadejong avatar Aug 18 '17 13:08 wadejong

@wadejong : are these discussion done on an online forum, so we can follow up on them?

tovrstra avatar Aug 18 '17 13:08 tovrstra

Not at this point. We had a first preliminary discussion yesterday related to long term support for BSE and talking about simplification of the architecture. I discussed this schema effort too, seeing if we can connect the efforts and unify things. I will suggest an online discussion forum for this.

On Aug 18, 2017, at 6:33 AM, Toon Verstraelen [email protected] wrote:

@wadejong : are these discussion done on an online forum, so we can follow up on them?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

wadejong avatar Aug 18 '17 13:08 wadejong

Thanks for the info. I looked at the EMSL schema for basis sets. These are my main thoughts. I hope it can be of use:

  • the EMSL scheme It serves a different purpose, namely to describe a basis set for a list of elements in the periodic table. It is not meant to describe molecular basis sets. Still there is a lot of overlap, so we could have some useful exchange of ideas.

  • The support for generalized contractions is rather limited in the EMSL XML scheme. The very general contractions as in CP2K are apparently not supported.

  • The EMSL scheme does not say anything about ordering of basis functions within one shell because it is not relevant in that context.

  • The EMSL scheme does not enforce any kind of normalization conventions for the contractions. (We ran into examples of unnormalized contractions at some point: theochem/horton#39.) We really need this to correctly interpret orbitals and other data. Enforcing normalization conventions is not very important for the EMSL scheme.

As it is, the EMSL scheme is not a one-stop solution here for two reasons: some features are missing and it is not meant to be used for describing molecular basis sets.

tovrstra avatar Aug 18 '17 13:08 tovrstra

In our extended Chemical JSON format currently implemented in a side version of NWChem we do have a complete format for Gaussian basis sets. Will drop it in when I get into the office.

I do like some of the ideas. Especially dealing with the generalized contractions. My format simply unpacks them one shell at a time, which leads to some replication. Using matrix formulations might be a good approach.

On Aug 18, 2017, at 1:42 AM, Toon Verstraelen [email protected] wrote:

Rough idea

The current schema does not yet have standardized method for specifying a Gaussian (or Slater) AO basis for a given molecule in full detail. This is needed when interpreting atomic orbitals or density matrices in the output. A simple string like "3-21G" is not enough because different codes use different conventions needed for the interpretation and because some basis set strings have not such a well-defined meaning.

I think we can mostly JSON-ize the molden format (see [GTO], [STO] and [MO] sections of http://www.cmbi.ru.nl/molden/molden_format.html) plus some improvements:

Make the ordering of the basis functions within one shell (and pure versus Cartesian) explicit as follows:

{"conventions": { "d": ["xx", "yy", "zz", "xy", "xz", "yz"], "f": ["-3", "-2", "-1", "0", "+1", "+2", "+3"] }} Allow very high angular momenta: s, p, d, f, g, h, i, j, k, l, m, n, o. (Not sure what should be done after o?)

Assume contractions are normalized already with some convention, i.e. do not assume that the program reading the contraction coefficients will fix the normalization for you. Instead, the program writiing out the contractions should take care. (Different programs use different conventions, which is information that you do not want to discard. The Molden format assumes contractions need to be normalized upon reading, which is not suitable for all cases.)

Support (very) generalized contractions, i.e. also compatible with basis sets from CP2K. See https://github.com/cp2k/cp2k/blob/master/cp2k/data/BASIS_MOLOPT (It does not get more general than that.) For every shell we should have a list of angular momenta, e.g. ["s", "p"], ["s", "s"] or ["s", "s", "s", "p", "p", "d"]. The last example is something you could encounter in a molopt basis set.

Keep a list of centers for the shells, which can be different from atomic positions, e.g. when using ghost atoms or doing other funny things.

Include pseudopotential specification? (I'm not an expert.)

Details for STO basis functions need to be worked out. (I'm not an expert on that either.)

Example for NH3

{"basis": { "type": "gto", "conventions": { "p": ["x", "y", "z"], "d": ["0", "+1", "-1", "+2", "-2"], }, "shells": [ [0, 8, ["s", "s"], [0.9046E+4, 0.1357E+4, 0.3093E+3, 0.8773E+2, 0.2856E+2, 0.1021E+2, 0.3838E+1, 0.7466], [0.6996174134E-3, 0.538605463E-2, 0.2739102119E-1, 0.103150592, 0.2785706633, 0.4482948495, 0.2780859284, 0.1543156123E-1], [-0.304990096E-3, -0.2408026379E-2, -0.1194444873E-1, -0.489259929E-1, -0.1344727247, -0.3151125777, -0.2428578325, 0.1094382207E+1], ], [0, 1, ["s"], [0.2248], [1.0]], [0, 1, ["s"], [0.6124E-1], [1.0]], [0, 3, ["p"], [0.1355E+2, 0.2917E+1, 0.7973], [0.5890567677E-1, 0.3204611067, 0.7530420618]], [0, 1, ["p"], [0.2185], [1.0]], [0, 1, ["p"], [0.5611E-1], [1.0]], [0, 1, ["d"], [0.817], [1.0]], [0, 1, ["d"], [0.23], [1.0]], [1, 3, ["s"], [0.1301E+2, 0.1962E+1, 0.4446], [0.3349872639E-1, 0.2348008012, 0.8136829579]], [1, 1, ["s"], [0.122], [1.0]], [1, 1, ["s"], [0.2974E-1], [1.0]], [1, 1, ["p"], [0.727], [1.0]], [1, 1, ["p"], [0.141], [1.0]], [2, 3, ["s"], [0.1301E+2, 0.1962E+1, 0.4446], [0.3349872639E-1, 0.2348008012, 0.8136829579]], [2, 1, ["s"], [0.122], [1.0]], [2, 1, ["s"], [0.2974E-1], [1.0]], [2, 1, ["p"], [0.727], [1.0]], [2, 1, ["p"], [0.141], [1.0]], [3, 3, ["s"], [0.1301E+2, 0.1962E+1, 0.4446], [0.3349872639E-1, 0.2348008012, 0.8136829579]], [3, 1, ["s"], [0.122], [1.0]], [3, 1, ["s"], [0.2974E-1], [1.0]], [3, 1, ["p"], [0.727], [1.0]], [3, 1, ["p"], [0.141], [1.0]], ], "centers": [ [-0.0140883131, 0.0845903925, 0.1037711513] [1.4952113836, 0.0214187375, 0.0445603623] [-0.5919457779, -1.6621666211, 0.5350312215] [-0.7075176488, 0.4654193413, -2.0214243076] ] }} Details

The shells field is a list, where each item represents one generalized contraction stored as a list with the following items:

center index (counting from zero?) the number of primitives the angular momenta for the generalized contraction the Gaussian exponents one or more lists of contraction coefficients. (More are present in case of generalized contractions.) To do

units — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

wadejong avatar Aug 18 '17 13:08 wadejong

@wadejong Could you post the Gaussian basis JSON format please? No rush. Thanks!

tovrstra avatar Aug 22 '17 15:08 tovrstra

Today.

Sent from my iPhone

On Aug 22, 2017, at 8:35 AM, Toon Verstraelen [email protected] wrote:

@wadejong Could you post the Gaussian basis JSON format please? No rush. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

wadejong avatar Aug 22 '17 15:08 wadejong

I added a directory Basis_Sets with an example Gaussian basis JSON file and also an example that has an ECP in it to the QC_JSON repository.

On Tue, Aug 22, 2017 at 8:35 AM, Toon Verstraelen [email protected] wrote:

@wadejong https://github.com/wadejong Could you post the Gaussian basis JSON format please? No rush. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MolSSI/QC_JSON_Schema/issues/12#issuecomment-324064948, or mute the thread https://github.com/notifications/unsubscribe-auth/AGa9cs-lOecMbsuiAWTC5lHyyRyZ6pXfks5savVGgaJpZM4O7Q7s .

wadejong avatar Aug 23 '17 00:08 wadejong

@wadejong Thanks!

It seems that these JSON files have a purpose comparable to the EMSL XML scheme, i.e. describing basis functions for atoms. This is typically the input one would give to a QC code, which is somewhat different from the problem I'm trying to address, how to represent the molecular basis set in the output of a QC calculation. (See comment above on XML scheme.)

After going through the https://github.com/MolSSI/QC_JSON_Schema/commit/ce3230167f4cc1276f1ff1128ff9c17b9759cdea, I wondered if it is preferable to work with dictionaries to represent the shells? I'm the idea above, I've just used a list in which the order of the elements fixes their meaning. That is not as self-explaining but it makes the JSON more compact.

tovrstra avatar Aug 23 '17 04:08 tovrstra

@tovrstra I mihgt be missing sth, but didn't we decide to represent shells as a list of namedtuple('Shell', ['icenter', 'iatom', 'angmoms', 'exponents', 'contractions'])?

FarnazH avatar Apr 07 '19 16:04 FarnazH

I have been working on this in Chemical JSON, and use a number of related arrays to arrange the data in linear arrays following the large array convention we decided to adopt. We have a basisSet object, and that has arrays for coefficients, primitivesPerShell, shellToAtomMap, shellTypes. There is an orbitals object too that has energies, moCoefficients (for alpha and beta optionally), occupations, and then we have an object for vibrations too. The example file has a perhaps too large molecule but a working implementation.

We have been able to use this along with coordinate sets and multiple MO coefficient arrays to animate a series of calculations where multiple calculations were performed on a system as it bonded.

cryos avatar Apr 08 '19 11:04 cryos

@cryos Thanks for the information! I have a few small questions:

Is there a detailed description of this JSON format, e.g. also including normalization conventions of primitives and contracted basis functions?

Would you have a smaller example too? (E.g. for testing purposes.)

Which implementations can already write Chemical JSON or read it for visualization and post-processing?

How does Chemical JSON handle generalized contractions?

tovrstra avatar Apr 08 '19 12:04 tovrstra

Coming back to this, we added a basis input specification from @bennybp which is a mimic of the data structure used in the new Basis Set Exchange. As a note this has been added here. This covers all use cases extant in the BSE and should provide good coverage of what is out there (@bennybp may have a few items to add).

I believe the only item that does not exist is ordering. For ordering I am leaning a bit towards simply providing ordering arrays as per CCLib. I believe the only other viable option would be the CCA standard?

dgasmith avatar Apr 23 '19 13:04 dgasmith

Either of these options seems reasonable. There was a lot of support for just doing CCA at the recent ACS symposium - anecdotal, but....

twindus avatar Apr 23 '19 14:04 twindus

@dgasmith Could you post a link to the CCA standard? Thanks.

tovrstra avatar Apr 23 '19 14:04 tovrstra

The CCA paper can be found here. Same ordering as FCHK/Psi4 here. I think there may be some oddities in FCHK higher order cartesian IIRC.

dgasmith avatar Apr 23 '19 14:04 dgasmith

Thanks. I'll come back to it later.

tovrstra avatar Apr 23 '19 15:04 tovrstra

At the very least we could recommend the CCA ordering even if we support others. It would make things simpler in many cases.

dgasmith avatar Apr 23 '19 15:04 dgasmith

I would not mind fixing the ordering of basis functions and related conventions in QCSchema, as long as it is all clearly defined. Something like "the same as in program ABC" is not good enough because most other programs never explicitly define their conventions. Here are some examples of potential sources of confusion from the codes mentioned above (not meant as criticism, just to illustrate):

  • The CCA paper only seems to mention the order of the Cartesian primitives, not pure ones, unless I missed something.

  • The FCHK format uses a different ordering than what PSI4 uses internally (for both pure and Cartesian functions), so I'm not sure what FCHK/PSI4 ordering means.

  • Libint can be configured to work with different orderings.

  • ...

Can we include in the QCSchema documentation or definition the following information?

  • Standard ordering of Cartesian and pure basis functions within one shell, with equations to rule out any ambiguities. (For example, ORCA uses rather odd sign conventions for pure functions for m>=3. Such aspects should be fixed by convention.)

  • Standard normalization of primitives, with equations.

  • How to handle unnormalized contractions? (As far as I remember, not all basis sets on EMSL have normalized contractions. Most programs renormalize them when setting up the basis set.)

tovrstra avatar Apr 24 '19 08:04 tovrstra

@twindus Do you have the CCA order for spherical basis sets? I think its 0, -/+ 1, ...

On that particular topic there is the fun question of what to do with p orbitals as the spherical transform does not effect the orbitals, just the order. Most programs do not bother with the transform.

I am leaning a bit back towards saying that the AO quantities must be in CCA order. It puts slightly more onus on the generation of the schema data, but makes everything downstream much easier to handle. We can supply translation routines in QCElemental/QCEngine that can handle this.


  • We can generate pure/cartesian orderings no problem. @tovrstra, can you expand a bit on the ORCA sign convention issue?
  • Do you have the normalization equations sitting around? I poked through libint's docs, but didn't find anything.
  • Does unnormalized contractions hurt us? The program should write out the basis and orbitals in whatever normalization was used internally and if they do/do not convert it should be encoded I believe.

dgasmith avatar May 07 '19 12:05 dgasmith

Regarding ORCA, it could be specifically an issue in their routines to write out Molden files. When we read in Molden files with apparently incorrectly normalized orbitals, we run a series of checks to detect common deviations from the original file format. Orca has a sign change in pure AO functions from m>=3 on, which could (or not) reflect their internal conventions. Our settings for reading ORCA Molden files can be found here: https://github.com/theochem/iodata/blob/master/iodata/formats/molden.py#L393

tovrstra avatar May 07 '19 14:05 tovrstra

From discussion with @wadejong and @cryos this week at Berkeley I think we are leaning towards having purely the CCA standard within the schema. This puts some additional effort upfront, but ultimately makes things far simpler.

Looking at the CCA standard (https://onlinelibrary.wiley.com/doi/epdf/10.1002/jcc.20815) the spherical ordering is L, ..., 0, ..., -L. However, the libint documentation appears to describe the CCA ordering as -L, ..., 0, ..., L.

@evaleev @twindus Is there a misunderstanding here?

dgasmith avatar Jun 04 '19 19:06 dgasmith