QCSchema replace schema with autogen from qcelemental

Description

In response to the discussion at #68 on maintainability, synchronization, accuracy, and community involvement, this is the initial commit refactoring schema storage here at QCSchema as autogenerated output from QCElemental. Overview here.

After I fill in questions and topics for discussion below, I'll be pinging the main participants of #68. I don't expect this to be a quick PR, but in the end it should close #68. (qcsk := QCSchema; qcel := QCElemental)

Questions

[ ] This is not a translation of qcsk into a qcel rendering but a reconciliation of the two representations. Changes are minor but present, so look it over and make sure nothing alarming. Because of the change from python to json in the dev/ dir, diffs are useless, but the things to compare are all the deleted qcschema/dev/*.py files to the new qcschema/dev/QCSchema.schema file.
[ ] The description fields for all the schema have been merged btwn qcsk and qcel and moderately edited for coherence. Suggested edits and elaborations and guidance welcome.
[ ] integer vs. number+multipleof issue. integer isn't part of JSON Schema Core, but it is part of JSON Schema Validation (https://pydantic-docs.helpmanual.io/usage/schema/#json-schema-types), from which we've also been using pattern, anyof, and enum. Ok to keep using integer then (which is pydantic's normal export for ints) rather than persuading pydantic to export number + multipleOf?
[ ] The separate (and sometimes redundant) files in qcschema/data/vdev/ are what I consider the core qcsk schema. Other schema from qcel haven't been exported to here. Even within the core schema are sections that might be considered associated with the qcel implementation. At a minimum, the fields where these live need to remain in qcsk. Should the subschema be adopted or removed?
- Molecule.Identifiers
- AtomicInput.Protocols
[ ] While it's a given that arrays shall be stored flat and units shall be atomic, I've added fields (transparent to validation) to hold shaped dimension and units info separate from the description. This follows from what was started in Wfn. Is this a satisfactory way to handle these matters?
[ ] List provenance. Once upon a time, I thought the provenance field should be either a single entry (of creator/version/routine/etc.) or a list of entries so that one could trace the layers of software that touched an instance. I think I got the list provenance added to this repo. since then, I was talked out of list provenance, ripped it out of psi4.Molecule, and haven't missed it. On the other hand, Horton folks like it. Presently in this repo, output.provenance is single or list and molecule.provenance and all others are single only. How do we want it to be?
[ ] Required fields.
- Two major use cases for required fields list
  - Initialization For a human writing JSON or the constructor for a qcsk implementation or for efficient web transport of non-default data, there is a minimum set of fields that can be used to construct a schema instance.
  - Storage For software wanting to use qcsk records or a programmer writing an implementation, there is a set of fields that should be present and already validated and defaulted so that no more post-JSON validator logic is required.
- Points
  - While many models require practically all fields (e.g., center data needs angular momenta, exponents, and coefficients), for some there's a big gap between required, deducible (by implementation, not by JSON validation), and optional fields. An example is Molecule, which I think of having the following breakdown:
    - required: symbols, geometry
    - deducible: atomic_numbers, masses, mass_numbers, real, fix_com, fix_orientation, labels, fragments, fragment_multiplicity, fragment_charges, molecular_charge, molecular_multiplicity, provenance, extras ({})
    - optional: fix_symmetry, connectivity, comment, name
  - Both previously (hand-written) and newly (autogen), the required fields is the initialization set (required). But it can be useful for implementation writers to know the storage set (required+deducible). It could even be useful to be able to call JSON validation on the storage set. Does this seem important, and how should we indicate these sets of fields?

Status

[ ] Ready to go

Sep 03 '20 14:09 loriab

This pull request fixes 1 alert when merging b6136480e9aa2bcefd3b95419a96f1b38bd248df into 09cb381e77704e809fb4d8cee4be47a7e17d6673 - view on LGTM.com

fixed alerts:

1 for Duplicate key in dict literal

Sep 03 '20 14:09 lgtm-com[bot]

This pull request fixes 1 alert when merging d7b31eb2c52af296576c44a1791cf73911a82f5a into 09cb381e77704e809fb4d8cee4be47a7e17d6673 - view on LGTM.com

fixed alerts:

1 for Duplicate key in dict literal

Sep 03 '20 15:09 lgtm-com[bot]

pinging @ghutchis @cryos @mattwelborn @dgasmith @bennybp @awvwgk @wilhadams as persons involved in #68 discussion who may have opinions. I'll leave this open at least another week. If discussion doesn't develop by then, I'll seek ordinary MolSSI merge approval.

Thanks for any comments.

Sep 14 '20 04:09 loriab

For me I would go back to concerns we had when we started discussing QCSchema - not tying it to one particular code/implementation. This feels a lot like that is what is happening, and that is how most of the other standards have evolved including Chemical JSON that I developed years before MolSSI was founded. Maybe it is unavoidable, and it is something we should simply accept while aiming to establish a flexible standard we can use. Not trying to be negative, just to summarize thoughts.

I would like to develop import/export of appropriate data to/from Chemical JSON making sure we can round trip datra, and build up some examples of simple files, and then increasing complexity that are version controlled. E.g. molecule with/without bonds, I am not sure what the status of more complex output is (equivalent of FCHK, or something I can generate orbitals/cubes from). I lost track a little in the midst of pandemic and changing jobs...

Sep 14 '20 14:09 cryos

In general, I'm all in favor of automation, but I'd agree with @cryos that we started with a community workshop, and the end result of this merge seems to imply "anything that happens to QCElemental becomes the schema."

Personally, I'd still rather see the reverse - that the schema lives here, with regular discussion about extension / changes, and auto-generated bits get submitted to QCElemental as a pull request.

-OR-

I don't understand why QCElement can't use a prefix / extension rather than force a change to the schema. What I remember of the in-person discussion was the need for extensions to the schema for particular programs, and they would get standardized. My analogy was with vendor-specific CSS prefixes.

molecule.identifiers has been discussed and is much needed in the core (e.g., tracking a molecule through a workflow, retaining SMILES bond orders, etc.). Most programs permit a title or comment.
atomicinput.protocols seems more specific to QCElemental

I know @berquist was working on an implementation for cclib: https://github.com/cclib/cclib/pull/798

Sep 14 '20 17:09 ghutchis

Thanks for the comments and concerns. While I'm personally fond of qcel, be assured that I don't want it to encroach on the independence of qcsk, and my primary motives are still of maintenance and engagement, https://github.com/MolSSI/QCSchema/issues/68#issuecomment-578202954.

I worry QCSchema the repository will stagnate and be a source of confusion to new implementors unless there's a project (preferably multiple projects) regularly running their "qcschema" instances through some sort of JSON validator tied to this repo. qcel hasn't been, and I don't know of other projects that might have been. (I suspect not, b/c there was a ( vs [ syntax error preventing this repo even being written to JSON.) That is, I think I agree with @cryos's thoughts about research-driven standards.

Personally, I'd still rather see the reverse - that the schema lives here, with regular discussion about extension / changes, and auto-generated bits get submitted to QCElemental as a pull request.

Regarding qcel dominating the development of qcsk, I hoped the approach in this PR could alleviate the worst concerns in #68. The reverse you describe would be best. However since pydantic classes can generate JSON schema, but JSON schema can't generate pydantic classes, that forces the dependency order. This PR does incorporate an important aspect of the reverse idea in that qcsk repo can block changes in qcel repo that alter QCSchema. Only after proposed changes are merged here, presumably after discussion, do schema-altering PRs to qcel become mergeable. (Well, not at the moment b/c this is just the initial delta schema rearrangement PR so that future diffs are meaningful.) So I think the loss-of-independence risk is greater for qcel than for qcsk. It'd be helpful to separate concerns about this future update process and concerns about the rearrangement and changes in this initial delta PR.

atomicinput.protocols seems more specific to QCElemental

Broadly, it's the user's opportunity to control the data layout. If the user (person or database) knows they don't care about the ASCII output, prevent it from being included. If the user (person or geometry optimizer) knows they don't care about the Fock matrix, prevent it from being included. Mostly of concern for minimizing internet transfer size. Definitely used by QCFractal, but I think it's of general interest.

I don't understand why QCElement can't use a prefix / extension rather than force a change to the schema. What I remember of the in-person discussion was the need for extensions to the schema for particular programs, and they would get standardized. My analogy was with vendor-specific CSS prefixes.

I may not be understanding your suggestion. I agree on the general progression from unformalized extension -> formalized and adopted. What the PR does for most big schemas (provenance is the exception that comes to mind) is (1) always include a field extras (hash type with string keys and anything values) that can be used freely for development fields like psi4:qcvars and (2) batten down on additionalProperties and forbid any other top-level fields not in the formal schema. It makes pass-through easier b/c everything unexpectable is behind one field. Is this rearrangement too abrupt or otherwise hard to work within?

I would like to develop import/export of appropriate data to/from Chemical JSON making sure we can round trip datra, and build up some examples of simple files, and then increasing complexity that are version controlled. E.g. molecule with/without bonds, I am not sure what the status of more complex output is (equivalent of FCHK, or something I can generate orbitals/cubes from).

Also in the course of this PR series, over at qcel I've added a bit into testing where the usual python tests drop JSON instances, then these are later run through JSON validation against the qcel-export qcsk. Those are generated, not stored, but can form another body of examples.

Sep 14 '20 20:09 loriab

Just a couple of comments before reading the actual changes.

Regarding the initial PR text:

While it's a given that arrays shall be stored flat and units shall be atomic, I've added fields (transparent to validation) to hold shaped dimension and units info separate from the description. This follows from what was started in Wfn. Is this a satisfactory way to handle these matters?

This is similar to datasets and metadata in HDF5 and will map nicely to that format.

Presently in this repo, output.provenance is single or list and molecule.provenance and all others are single only. How do we want it to be?

I think this is less important if there is the ability to trace in some other way the previous calculations that were performed to create a single input, such as performing a frequency calculation after a separate geometry optimization. I know that we discussed such calculations, plus finite difference, SAPT, and so on as being separate in the eyes of logical QCSchema blocks, but we will need a way of connecting them. List provenance is one way (albeit lossy) of helping with this. If there is provenance (or extensions) on the level of granularity one up from fields (for example, SCF energy) so that program modules can register their own provenance (SCF-specific info from this package, CC-specific info from another package). This doesn't answer the question, but I haven't read the PR yet.

Regarding later comments:

I worry QCSchema the repository will stagnate and be a source of confusion to new implementors unless there's a project (preferably multiple projects) regularly running their "qcschema" instances through some sort of JSON validator tied to this repo.

Because I haven't been watching this repo, only QCSchema, I didn't know how deeply a part of qcel the schema was, having only hooked into the periodic table and physical constants. I know parts of Psi4 were migrated here, but didn't check the contents. So I am/was confused from the other direction.

If the user (person or geometry optimizer) knows they don't care about the Fock matrix, prevent it from being included. Mostly of concern for minimizing internet transfer size. Definitely used by QCFractal, but I think it's of general interest.

This will definitely be useful for source programs themselves (engines?); for translators it's less clear, but its inclusion doesn't hurt.

For the prefix/non-standard extensions,

Is this rearrangement too abrupt or otherwise hard to work within?

I think your example is actually what we agreed on at the meeting.

Ultimately, we need to have more substantiative examples than what's in https://github.com/MolSSI/QCSchema/tree/09cb381e77704e809fb4d8cee4be47a7e17d6673/tests/simple, unless these fully cover the current schema, excluding the geometry optimization differences. If the generated qcel examples fit the bill then that's good, otherwise maybe something from QCFractal?

Sep 15 '20 04:09 berquist

This PR does incorporate an important aspect of the reverse idea in that qcsk repo can block changes in qcel repo that alter QCSchema. Only after proposed changes are merged here, presumably after discussion, do schema-altering PRs to qcel become mergeable.

This doesn't change the fact that in order for a change to be made to the schema, it must be made to the Pydantic models in QCElemental. Thus, QCSchema is inherently dependent on the implementation in QCElemental. One specific concern I have is that QCSchema seems to be using JSON draft 04 only because Pydantic does. If the choice of JSON specification is made for that reason, then not only is QCSchema dependent on QCElemental, but also on Pydantic's choice of specification to support.

Pydantic is working off draft07, I think. That draft04 was forcibly added by me to imitate qcsk. https://github.com/MolSSI/QCElemental/pull/237/files#diff-7a9af1eb4f18dc820783bf210d2cfc95R197 https://github.com/samuelcolvin/pydantic/issues/1478

Sep 18 '20 06:09 loriab

A couple of thoughts from my point of view (cclib dev):

I think QCSchema/QCElemental/QCEngine are all really cool - a lot of excellent work being done here! I've been following the projects, but not closely.
I too feel it would be healthier to have a schema not tied to a specific proejct/organization, or at least tied in equal measure to several projects/organizations.
I actually like the idea of having the source of truth in Python or some other more robust representation - and exporting JSON or whatnot from that. I never quite understood why the standard needs to be specific to JSON/XML or another serialized form.
Perhaps a fork of QCSchema/QCElemental that sets a common standard as a consensus of several orgs (MolSSI, OpenChemistry, others) is something to try?
Communities are great, but capricious. If QCEngine fails, will QCElemental have a life of it's own. Would it be used by someone else to develop QCEngine2 from scratch?
I'm somehow reminded of Quixote (https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-3-38) - I'm not sure how successful that was and whether it's still alive at all, but I wonder if there are takeaways about what worked well and what could be done better this time around.

Sep 22 '20 17:09 langner

While it is undoubtedly best for the schema's generality to not originate from QCElemental, a significant push of this PR is that QCElemental ended up receiving much more attention, community effort, and uptake than the QCSchema repo as the library stack underpins a variety of efforts (OpenFF, Psi4 ecosystem, QCArchive, etc). A practical issue is if the schema implementation of Elemental is moved to this repo or the schema is expanded here, will it realistically influence the uptake of the project for those commenting here?

It became reasonably clear that we ended up with the schema + 1 problem without an implementation driving requirements and uptake. But then the driving project requirements became imbalanced against community requirements since there wasn't much push for the schema itself. Perhaps this is the inevitable issue with schema without more explicit resources focused on promoting the schema itself than was allocated for this project.

It would be terrific if MolSSI could weigh in on future directions.

Sep 23 '20 00:09 dgasmith

I am not aware of any continued effort with Quixote @langner, there were some great ideas and there was a lot of energy around approaches being explored at the time. I think the QCSchema will likely end up being supported to the extent that many other formats are to interoperate with the projects that use it, our original goals may not be realized if it is driven by one or two dominant projects. It was valuable to gather community feedback, and to explore how to better document/validate JSON formats.

This is one area we had hoped that MolSSI could help, it may see wider uptake. It would be great to see a community schema developed even if this is not the one, several of us have our pet schemas if not and the format translation in the form of cclib that can add one more format to their repertoire and be easily reused in other projects.

Sep 23 '20 13:09 cryos