QCSchema Request wavefunction data returns

A key component of the schema which we have not hit on too much so far is the return of orbitals/densities/eigenvalues for visualization and passing data between programs. I would like to push the discussion of the return types and storage of these quantities off to a separate topic (there should be one soon discussing ordering and the like).

These "wavefunction" returns would be isolated to anything of the size of the basis set or larger. Browsing around it seems like the following quantities are useful to return:

Orbital
Densities
Eigenvalues
Overlap matrix
Is there anything else crucial for a first pass?

Would a proposed structure like the following work?

{
    "return_wavefunction_data": {
        "orbitals": True,
        "density": True,
        "eigenvalues": True,
        "overlap": True
    }
}

with a similar return structure.

Questions:

Should these go in the keywords argument or present a new top level option?
Is it sufficient to return the AO matrices only for now and consider spatial symmetry at a later date?
Should the output live in the current properties field which is currently restricted to single numbers and small arrays.
Output keys should be able to handle alpha/beta perhaps: orbitals_alpha?

Mar 30 '18 16:03 dgasmith

I think here we really should think about what data gets stored in the JSON file, and which data we may want to provide a link path to, for example, an binary HDF5 file or something like that that we have a standardized format for.

When it comes to densities and overlap matrices, how much do we want to allow for on the fly compute of the visualization software vs. providing them all the data without compute. What density are we looking at, a density matrix or density on a grid?

Mar 30 '18 17:03 wadejong

I think here we really should think about what data gets stored in the JSON file, and which data we may want to provide a link path to, for example, an binary HDF5 file or something like that that we have a standardized format for.

Why not provide the facilities to do both things for all data?

When it comes to densities and overlap matrices, how much do we want to allow for on the fly compute of the visualization software vs. providing them all the data without compute. What density are we looking at, a density matrix or density on a grid?

IMO it should be possible to provide arbitrary data on a grid, with annotations and conventions for describing what it is. Same for matrices - think spin density matrix, etc.

Mar 30 '18 17:03 langner

Agree, on both. Outputting matrices is relatively easy, any data on a grid is an extra compute step that is not a standard task.

Mar 30 '18 17:03 wadejong

For the first pass I would encourage simple exports that would reproduce CClib's quantities for example. We can get fairly bogged down in making something complex that is not easily exportable by current codes while something like densities and orbitals are quite generally valuable even in list form.

Mar 30 '18 17:03 dgasmith

For the first pass I would encourage simple exports that would reproduce CClib's quantities for example. We can get fairly bogged down in making something complex that is not easily exportable by current codes while something like densities and orbitals are quite generally valuable even in list form.

Maybe. On the other hand, the benefit of creating a schema from scratch is that we can discuss such things and settle on something sufficiently generic. I worry that having a simple field for specific matrices will disincentivize creating a more generic solution down the road. We could just punt on this entirely to make the first past simple and minimal.

Mar 30 '18 18:03 langner

If possible I would like to split the line where we generate something that can produce results now, but is extensible and flexible that we can do more specifics things in the future.

For example, for the key could be:

"orbitals": True,

However in the future, this could be extended to the following during later versions:

"orbitals": {
    "representation": "grid",
    "points" : [0, 0, 0, 0, 0, 1...],
    "sum_a_b": True,
    ....
}

Mar 30 '18 19:03 dgasmith

If possible I would like to split the line where we generate something that can produce results now, but is extensible and flexible that we can do more specifics things in the future.

So we don't plan to be backwards compatible?

Mar 30 '18 21:03 langner

The above can be viewed as different inputs, and is valid/digestible in most languages. In this case, a simple True would return the orbitals in what we would guess would be the most used format (I would hazard a simple matrix without symmetry). While more complex options can be expanded upon in the future.

If you have a good idea on how to do be both expressible as well as provide simple options please do suggest it. It's also worth considering that downstream QM programs must actually implement this which is more likely to happen if the first spec is straightforward.

Mar 30 '18 22:03 dgasmith

Don't forget that one of the most voted-on points was "aim for novice programmer". Related to https://github.com/MolSSI/QC_JSON_Schema/issues/39, I think introducing complexity that requires checking value type (is orbitals a bool, a compound object like the example above, an HDF5 file reference, ...) is much cleaner than key proliferation.

For the first pass I would encourage simple exports that would reproduce CClib's quantities for example. We can get fairly bogged down in making something complex that is not easily exportable by current codes while something like densities and orbitals are quite generally valuable even in list form.

So we don't plan to be backwards compatible?

I thought the conclusion we came to was that as long as the schema version was appropriately marked, this would be ok.

If that's the case, then the transition is made easier: the first implementation only works with the simple representation, which can be phased out with the second implementation. The problem is that cclib's quantities are too simple and inflexible for what I'll call the "average grad student workflow", and the internal representation now requires a potentially backward-incompatible update.

Mar 30 '18 23:03 berquist

The problem is that cclib's quantities are too simple and inflexible for what I'll call the "average grad student workflow", and the internal representation now requires a potentially backward-incompatible update.

FWIW, we can be backwards incompatible in cclib if need be as well, as long as we plan ahead and bump up to 2.x at the right time.

I suppose it's better to err on the side of simplicity.

Mar 30 '18 23:03 langner

I second (or third) the race-ahead and implement, backwards-compatibility be disregarded in this the pre-1.0 stage, especially as only so much can be planned before implementation.

Apr 12 '18 01:04 loriab