OPTIMADE icon indicating copy to clipboard operation
OPTIMADE copied to clipboard

SMILES data type

Open merkys opened this issue 1 year ago • 7 comments

In #368 SMILES property for structures was proposed. PR #392 was proposed to introduce string-valued SMILES property, however, there were opinions that internal semantics of SMILES strings have to be respected, what does not fit nicely with "plain string" data type. Thus the current PR introduces both smiles data type and an associated property.

Fixes #368.

merkys avatar Dec 07 '22 07:12 merkys

The introduction of the SMILES data type will most likely also require some changes similar to those introduced in PR #444 for the timestamp data type.

vaitkus avatar Dec 22 '22 17:12 vaitkus

The introduction of the SMILES data type will most likely also require some changes similar to those introduced in PR #444 for the timestamp data type.

Fair point. However, I am not sure JSON Schema supports custom formats, I need to have a better look at it.

merkys avatar Dec 30 '22 07:12 merkys

The introduction of the SMILES data type will most likely also require some changes similar to those introduced in PR #444 for the timestamp data type.

Fair point. However, I am not sure JSON Schema supports custom formats, I need to have a better look at it.

It does, see: https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-validation-01#section-7.2

Lots of impenetrable waffle in the spec about it though -- not clear to me how we would announce what "format": "smiles" actually means within JSON Schema yet. We could at least include a regex for structural validation.

ml-evs avatar Dec 30 '22 12:12 ml-evs

The introduction of the SMILES data type will most likely also require some changes similar to those introduced in PR #444 for the timestamp data type.

Fair point. However, I am not sure JSON Schema supports custom formats, I need to have a better look at it.

It does, see: https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-validation-01#section-7.2

Lots of impenetrable waffle in the spec about it though -- not clear to me how we would announce what "format": "smiles" actually means within JSON Schema yet. We could at least include a regex for structural validation.

Reading your link:


Implementations MAY support custom format attributes. Save for agreement between parties, schema authors SHALL NOT expect a peer implementation to support such custom format attributes. An implementation MUST NOT fail to collect unknown formats as annotations. When the Format-Assertion vocabulary is specified, implementations MUST fail upon encountering unknown formats. Vocabularies do not support specifically declaring different value sets for keywords. Due to this limitation, and the historically uneven implementation of this keyword, it is RECOMMENDED to define additional keywords in a custom vocabulary rather than additional format attributes if interoperability is desired.


So, if I read this correctly, they are saying: "But, don't use this, use your own field instead", e.g., x-optimade-type?

The third (IMO very inelegant) option is to specify the format as "regex" + the best JSON Schema compatible regex we can come up with for SMILES strings. Implementations would then have to recognize specifically that regex and reverse-map it into the knowledge that the field is of OPTIMADE SMILES-type. However, in that case I think x-optimade-type is about 1e10 times better as a solution.

A happy side note: it appears the JSON Schema regex situation is more clear now than when we looked into it last time. They now define a "least common denominator subset" regex standard.

https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-01#name-regular-expressions

To me, this seems to finally provide a solution to the regular expression dilemma discussions (#42 #160). We should now simply follow their lead and standardize this both for our filter language and for format: regex in property definitions (which presently are not allowed at all). In contrast to JSON Schema, our property definitions should say that the regex MUST only use the subset.

rartino avatar Dec 30 '22 21:12 rartino

So, if I read this correctly, they are saying: "But, don't use this, use your own field instead", e.g., x-optimade-type?

The third (IMO very inelegant) option is to specify the format as "regex" + the best JSON Schema compatible regex we can come up with for SMILES strings. Implementations would then have to recognize specifically that regex and reverse-map it into the knowledge that the field is of OPTIMADE SMILES-type. However, in that case I think x-optimade-type is about 1e10 times better as a solution.

x-optimade-type sounds reasonable to me. Should I open a separate PR to add x-optimade-type to the specification or should I lump it together with this one?

merkys avatar Jan 17 '23 08:01 merkys

x-optimade-type sounds reasonable to me. Should I open a separate PR to add x-optimade-type to the specification or should I lump it together with this one?

Lets take it in the open PR on property definitions; I have another thing that should be adjusted in the specification there as well, moving the identifier to "$ID". If you want to hurry things along, feel free to do a PR against my branch for that PR, or don't and I'll add x-optimade-type as I add '$ID'.

rartino avatar Jan 17 '23 08:01 rartino

Lets take it in the open PR on property definitions; I have another thing that should be adjusted in the specification there as well, moving the identifier to "$ID". If you want to hurry things along, feel free to do a PR against my branch for that PR, or don't and I'll add x-optimade-type as I add '$ID'.

I am OK to wait. Thanks.

merkys avatar Jan 17 '23 09:01 merkys