QCSchema Keeping QCSchema in sync with QCElemental

QCSchema lags behind what's actually implemented AND DOCUMENTED in QCElemental. The QCSchema docs are where people look for info, so this creates a misleading impression about what QCSchema is/does in practice.

Some possible solutions:

QCSchema/docs is always what you get from running .schema()/autodoc on the myriad models in QCElemental
QCElemental models represent schema in development, and from time to time we concretize its state into release versions of QCSchema.
QCElemental models are not allowed to change until QCSchema changes.

My favorite option is 1. I'm okay with 2. I think 3 is a bad idea.

@dgasmith @bennybp

Jan 23 '20 15:01 mattwelborn

+1 option 1

extras field is effectively the "scratch space"/"in development" role

Jan 23 '20 18:01 loriab

As an external user, I would strongly push for choice 1.

Promoting option 1, IMHO sends a strong signal that this schema is only appropriate for QCElemental, and it's no longer intended as a cross-program interchange.

Option 3 sends the signal that the community input is driving the schema, but I can understand that's frustrating for QCelemental.

So I would strongly push for choice 2.

Jan 23 '20 22:01 ghutchis

Promoting option 1, IMHO sends a strong signal that this schema is only appropriate for QCElemental, and it's no longer intended as a cross-program interchange.

That's certainly not what I'd want signaled. Rather, I like option 1 for its single-source-of-truth property. Do you think calling qcel .schema() the docs/source with qcel as a reference implementation and doing the part of option 2 of periodically concretizing and publishing to QCSc versions would assuage that impression?

PRs suggesting changes to the data layout of qcel.models could be better marked with a "Schema" label and held open for comment to make sure changes don't sneak up on the community.

Jan 23 '20 23:01 loriab

I feel pretty strongly about this.

This is an independent repo with a set of existing issues, PR, etc. for the schema. IMHO, QCElemental should be a client of this repo (i.e., it depends on the schema). The single source of truth should be here.

Frankly, I'm unlikely to use QCElement because every project already has similar functionality (cclib, Open Babel, Avogadro, etc.). Until I saw this issue, I frankly didn't know it existed.

This repo was created for the express purpose (and declared by its description) of being the home of the QC schema.

Consider also that https://molssi-qc-schema.readthedocs.io/en/latest/auto_topology.html is not a subset of the QCElemental docs.

Please don't make the schema subservient to another project. If you want to use QCElemental to work on development versions, that's fine. But the standard version should be declared here and IMHO, all issues, discussion relevant to the schema should be in this repo, not via some tag in a different repo.

Jan 24 '20 01:01 ghutchis

I think a clear distinction here is that developing in QCElemental with option 1/2 doesn't necessitate using QCElemental as we will be exporting JSON-Schema to a repo (likely here). QCElemental is used to compose the schema, provide a Python-based structure for those who want it, or generate JSON-Schema like we do here. In the above scenarios I would much prefer 2) over 1) to for the comments listed above.

In general, we are finding QCElemental as a driving force for cross-program interchange because there is an implementation behind it that does something useful. A concrete schema implementation that has the bells and whistles needed forms a cornerstone of an ecosystem. We can now compute a dozen programs using QCSchema with QCEngine and I know of ~40M results computed with QCFractal using the Schema. There are now active learning programs, force field fitting toolkits, visualization tools, etc all built on top of the Schema and data repos like Fractal.

For example it would be great if Avogadro picked up QCEngine (which uses CCLib) instead of writing their own quantum chemistry compute routines. You probably need a few more programs with orbitals built in, but thats easily done. Come talk to us.

Jan 24 '20 14:01 dgasmith

I think if we are to drive adoption then option 3 is the best, with 2 being a more pragmatic second best. I don't think we are going to drop directly using QM codes in Avogadro, but adding something like QCEngine as another option would work assuming it offers the basic capabilities we need. I see the most value in codes adopting QC JSON natively, but there is a lot of value in something like QCEngine doing the translation.

As you know Chemical JSON was developed using something akin to option 1 from the start. I think it has utility, but it is something different to what we set out to achieve with QC JSON. It would be a shame to see the goal of forming a community schema recede, but we knew from the start it was tough to do. MolSSI is the organization with the funding and time to put in to efforts like this.

Jan 24 '20 14:01 cryos

@dgasmith - while I agree that you don't have to use QCElemental under option one, that choice implies that the 'ground truth' for the schema falls to that project, not to this repo. I'm repeatedly telling you that sends the wrong message to the community.

If you want to get more community feedback, I'm happy to ping others from the meeting to come here and comment.

Doing something useful with the schema is incredibly important, as is continual development.

I have a pretty big pool of calculations myself (~20M) and Harvard Clean Energy is.. 150M? That shouldn't factor into where and how the schema is standardized.

IMHO option 2 sounds like something you'll accept, and I think that's the pragmatic way forward.

Jan 24 '20 16:01 ghutchis

My primary goals are to:

Not write schema docs & types twice, where one gets autotranslated into proper JSON and basic sanity checked with pydantic and the other sits here with perhaps my "number"/"float"-type mistakes uncaught.
Collect issues and discussion on schema where the opinions are. That seemed to have dwindled in this repo, but if people are here, so let the discussion be also.

So option 2 is fine by me.

Jan 24 '20 16:01 loriab

I see the most value in codes adopting QC JSON natively, but there is a lot of value in something like QCEngine doing the translation.

I completely agree, but this is a very slow process as noted. Just to echo: QCEngine exists to help move the process along and provide demonstrable use cases.

I'm repeatedly telling you that sends the wrong message to the community.

I get that, but I am confused of why this is the case. A move to Elemental seems more like "MolSSI is building tools for a community built schema, let us build out a Python ecosystem for communication that can also be used in any language." Which seems like a reasonable message otherwise we are saying "here is a schema that has no uses or community".

I have a pretty big pool of calculations myself (~20M) and Harvard Clean Energy is.. 150M? That shouldn't factor into where and how the schema is standardized.

Definitely, which is why the number was a pretty minor part of a paragraph. I was attempting to paint a picture that schema is beginning to be accepted by some of the community by having useful libraries that support it.

Collect issues and discussion on schema where the opinions are. That seemed to have dwindled in this repo, but if people are here, so let the discussion be also.

I echo this quite strongly, Elemental has a strong user base that we just haven't been able to build here. If we can kick this repo along let us do it, but we so far haven't been successful at that.

I should note here that we have largely been taking strategy 3 so far where we add first to this repo and then to QCElemental. I think the only place where this isn't the case and we take strategy 2 is a geometry optimization input/output schema where we currently acknowledge that this repo is the single source of truth and we may have to change.

Jan 24 '20 18:01 dgasmith

Below is a proposal that tries to strike a balance between @loriab's points about correctness and de-duplication, and @ghutchis's points about community engagement, a single source of truth, and the QCSchema existing separate from QCElemental. It sort of falls between options 2 and 3.

Community input

1: The MolSSI/QCSchema repository MUST continue to exist. 1.1: Input from the community about QCSchema MAY be filed as issues on MolSSI/QCSchema. 1.1.1: These issues MUST be linked as issues in MolSSI/QCElemental. 1.1.2: The maintainers of MolSSI/QCSchema MAY encourage discussion to be moved to MolSSI/QCElemental if it is more model-related. 1.2: Input from the community about QCSchema MAY be filed as issues on MolSSI/QCElemental. 1.2.1: These issues MUST be linked as issues in MolSSI/QCSchema if they affect data layout/schema. 1.2.2: These issues MAY be linked as issues in MolSSI/QCSchema if they affect implementation details. 1.2.3: The maintainers of MolSSI/QCElemental MAY encourage discussion to be moved to MolSSI/QCSchema if it is more schema-related.

Releases

2: MolSSI/QCSchema MUST have versioned releases.

QCElemental models

3: Models in MolSSI/QCElemental MUST be marked as to whether they are part of QCSchema. 3.1: Models marked as part of QCSchema implemented in MolSSI/QCElemental/master are the single source of truth. The actual JSONSchema files in MolSSI/QCSchema/master MUST be generated from these models. 3.1.1: Accepted changes to the QCSchema (see section 1) MUST be implemented in MolSSI/QCElemental, and MolSSI/QCSchema MUST be subsequently updated. 3.1.2: PRs that change MolSSI/QCSchema MUST originate in MolSSI/QCElemental. 3.1.2.1: These PRs MUST be cross-linked as issues in MolSSI/QCSchema.

Documentation

3.1.2: The documentation of QCSchema in MolSSI/QCSchema MUST be auto-generated from the QCSchema models in MolSSI/QCElemental. 4: The default documentation on RTD MUST correspond to the latest released version of MolSSI/QCSchema. 4.1: The RTD corresponding to MolSSI/QCSchema/master MAY also be made available.

Items 3.1 and 3.1.1 are intended to establish a single source of truth while keeping MolSSI/QCSchema and MolSSI/QCElemental up to date and synchronized. Ideally, MolSSI/QCSchema would be the single source of truth, but models are a superset of the schema, and the thing that actually gets tested. As a result, we are likely to end up with correct models and incorrect schema.

Jan 24 '20 18:01 mattwelborn

So now we do line-item voting, Security Council veto rules?

Just kidding. Matt's proposal sounds good to me. Only a couple items I want to make sure we're on the same page about:

1.2.1 QCEl MUST cross link to QCSc on when QCEl issues/PRs arise affecting data layout and MAY cross link on those affecting implementation details. So I'd expect https://github.com/MolSSI/QCElemental/issues/155 to need an issue opened here but probably not https://github.com/MolSSI/QCElemental/issues/169. For PRs, I can probably put in a linter that checks if PRs will affect the rendered schema, so a cross post can be made. Does this sound right or is it wanted that implementation issues/PRs cross-link, too?
Difference between schema stages.
- preliminary schema, those being pre-implementation like https://github.com/MolSSI/QCSchema/issues/67, whose discussion and any PRs (about which we can be more relaxed about strict JSON formatting) belong here in QCSchema
- operational schema, those that are adopted by QCSc and implemented in QCEl, can have issues in either repo (section 1 cross-link guidelines apply) but PRs should be solely in QCEl.
- qcel schema, those like the OptimizationResult that are useful for QCEl and dependent projects but won't or haven't yet been broached to QCSc.

Jan 25 '20 19:01 loriab

+1 I think this seems like a pragmatic approach.

Jan 25 '20 20:01 ghutchis

@loriab - If you can add a linter or Github action to check if PRs affect the schema, that would be a great solution.

Jan 25 '20 20:01 ghutchis

Sounds like a reasonable and pragmatic approach to me too. I personally never thought this would be easy, but I am passionate about seeing it move forward and gain wider adoption as features are added.

Jan 25 '20 20:01 cryos

Apologies for accidentally closing this - mouse slippage...

Jan 25 '20 20:01 cryos

I am not ignoring this on purpose. I do want to weigh in on this soon, however I am currently traveling and teaching (and there is a lot here to absorb). I will try to come up with some coherent thoughts tonight.

Jan 25 '20 20:01 bennybp

Thanks for this issue! Similar to @ghutchis I didn't got the connection between QCElemental and QCSchema in the first place. It would be really helpful to have the list from above, or whatever the final form will be, as a contributing guideline for this repository or in some visable place. I for one didn't consider contributing to QCSchema so far because I didn't really knew how.

Jan 27 '20 20:01 awvwgk

So now we do line-item voting, Security Council veto rules?

Just kidding. Matt's proposal sounds good to me. Only a couple items I want to make sure we're on the same page about:

1.2.1 QCEl MUST cross link to QCSc on when QCEl issues/PRs arise affecting data layout and MAY cross link on those affecting implementation details. So I'd expect MolSSI/QCElemental#155 to need an issue opened here but probably not MolSSI/QCElemental#169. For PRs, I can probably put in a linter that checks if PRs will affect the rendered schema, so a cross post can be made. Does this sound right or is it wanted that implementation issues/PRs cross-link, too?

Difference between schema stages.

preliminary schema, those being pre-implementation like #67, whose discussion and any PRs (about which we can be more relaxed about strict JSON formatting) belong here in QCSchema

operational schema, those that are adopted by QCSc and implemented in QCEl, can have issues in either repo (section 1 cross-link guidelines apply) but PRs should be solely in QCEl.

qcel schema, those like the OptimizationResult that are useful for QCEl and dependent projects but won't or haven't yet been broached to QCSc.

Yes to all of this. I have updated my earlier comment accordingly.

We should look into making a bot that creates cross-linked issues. For example, if an issue is made on MolSSI/QCSchema titled "Foo", then a corresponding issue gets made on MolSSI/QCElemental titled "QCSchema issue #123: Foo". This issue's body would then just be a link to MolSSI/QCSchema#123. On the QCEl side, we would do the same thing, but only for issues that have the "Schema" tag.

Jan 28 '20 15:01 mattwelborn

We should look into making a bot that creates cross-linked issues. For example, if an issue is made on MolSSI/QCSchema titled "Foo", then a corresponding issue gets made on MolSSI/QCElemental titled "QCSchema issue #123: Foo". This issue's body would then just be a link to MolSSI/QCSchema#123. On the QCEl side, we would do the same thing, but only for issues that have the "Schema" tag.

I wasn't going to bring it up until it was proof-of-principle, but since Matt mentioned it, I'm working on https://github.com/MolSSI/QCElemental/pull/204.

The notion is that there's a GHA that runs on every PR of qcel and generates the json schema files from a list of models (provenance, molecule, etc.). Then it runs a diff against qcsk (QCSchema) master and if there's a change, creates a branch (working up to this point) and PRs that to qcsk and returns red X to qcel. The PR here can be a place for discussion, any changes can be added to the qcel PR, and when auto PR is merged here, then qcel will show green check. This is my first foray into GHA, so I or GHA may still fail, but it's going easily so far. How does the plan sound?

A side issue is that the QCSchema repo is presently a python module. This is for historical reasons of testing and documentation. Since the former is handled at qcel and the latter can be reworked, what does everyone think of the qcsk repo being JSON files instead, like https://github.com/loriab/QCSchema/tree/py2json/qcschema_json ?

Jan 28 '20 18:01 loriab

Adding here a link to a PR I created a while back that seeks to standardize the QCSchema models to a common interface. I'm in favor of implementing the models only once, and exporting that json schema from the pydantic models. What I propose in this (draft) PR would be a suggestion for what v2 of the schema could look like: https://github.com/MolSSI/QCElemental/pull/287

Open to comments/suggestions. Thanks!

Jun 10 '22 16:06 coltonbh