OPTIMADE icon indicating copy to clipboard operation
OPTIMADE copied to clipboard

InChIKey property

Open merkys opened this issue 1 year ago • 11 comments

This PR adds InChIKey property for structure entries. InChIKey is a chemical structure descriptor alternative to SMILES (proposed in #368). InChIKey is said to avoid the the ambiguity the SMILES possesses, moreover, it does not have any internal structure (essentially being a "chemical checksum"), thus there should be no issues with its comparisons.

Pinging people who have expressed their interest for comments: @eimrek @utf @Austin243

merkys avatar Jun 08 '23 07:06 merkys

Workshop: after the above workshop comment has been considered/handled this should be merged.

We have discussed if "complicated derived non-unique descriptors" really belongs inside the structures entry. However, before we have a decision/design on where they should go, we should accept fields that are useful and desired into the structures endpoint for now.

A similar regard holds for the desire to allow application area/subconsortia prefixes. Until we have a model for that, we merge fields without them.

rartino avatar Jun 09 '23 07:06 rartino

@merkys @ml-evs What do you think - should we still add this to "core OPTIMADE" as decided in the web meet. Or, now that we are hopefully somewhat close to merging #473, should all of these 'chemically oriented fields" currently waiting in PRs by @merkys instead go in its own namespace?

These PRs are affected: #466 #465 #436 #398 - and the question is if they should be labeled 'status/blocked' (blocking on #473) or 'status/waiting-for-update' (since they all currently are sitting with comments to address before merging)

Edit: possibly also #400, #396, or would those go in a "bio" prefix?

rartino avatar Jan 10 '24 09:01 rartino

@merkys @ml-evs What do you think - should we still add this to "core OPTIMADE" as decided in the web meet. Or, now that we are hopefully somewhat close to merging #473, should all of these 'chemically oriented fields" currently waiting in PRs by @merkys instead go in its own namespace?

These PRs are affected: #466 #465 #436 #398 - and the question is if they should be labeled 'status/blocked' (blocking on #473) or 'status/waiting-for-update' (since they all currently are sitting with comments to address before merging)

Edit: possibly also #400, #396, or would those go in a "bio" prefix?

On purely technical/scientific terms I think this field would be perfect to seed a cheminformatics namespace (along with SMARTS/SMILES etc -- would have to figure out how to allow filter_smarts as a custom URL param too...). However on a purely development practice, I worry that we don't have the level of engagement or scale to spread ourselves so thinly across these various namespaces, in which case just loading up the core OPTIMADE namespace is maybe preferable. Happy to discuss!

ml-evs avatar Jan 10 '24 22:01 ml-evs

@ml-evs maybe we can use this example to try out the infrastructure and see where we hit snags? I'm not sure we absolutely need engagement at this point, we already have 4-6 PR:s for properties to place in such a namespace-provider standard, which we can do under a "v0.1" to mark that things are highly experimental. If people show up who want to manage this, we'll happily hand off the repo to them.

I've created a couple of new repos for this:

  • https://github.com/Materials-Consortia/definition-provider-template : GitHub template repo to create new definition-provider repos
  • https://github.com/Materials-Consortia/namespace-cheminformatics : live repo for the cheminformatics prefix where we can try to integrate @merkys cheminformatics definitions and see how far we get.

rartino avatar Jan 11 '24 00:01 rartino

Great! Thanks for this @rartino -- I definitely stalled in my attempts to do the same thing. I will try to migrate https://github.com/Materials-Consortia/optimade-stability-namespace in the same direction.

ml-evs avatar Jan 11 '24 14:01 ml-evs

@merkys @ml-evs What do you think - should we still add this to "core OPTIMADE" as decided in the web meet. Or, now that we are hopefully somewhat close to merging #473, should all of these 'chemically oriented fields" currently waiting in PRs by @merkys instead go in its own namespace?

These PRs are affected: #466 #465 #436 #398 - and the question is if they should be labeled 'status/blocked' (blocking on #473) or 'status/waiting-for-update' (since they all currently are sitting with comments to address before merging)

I agree that these properties #466, #436, #398 could go to their own namespace. I will try to describe them in https://github.com/Materials-Consortia/namespace-cheminformatics. Not sure about #465 though, to me it seems generic enough to appear in the core specification, but again, I might be speaking from cheminformatics PoV.

Edit: possibly also #400, #396, or would those go in a "bio" prefix?

Properties introduced in #400 make sense to macromolecules only, but hierarchical grouping introduced in #396 might have an even wider usage than in cheminformatics. Thus I would not rush them under namespace-cheminformatics umbrella for the time being.

merkys avatar Jan 15 '24 10:01 merkys

InChIKey has been implemented in https://github.com/Materials-Consortia/namespace-cheminformatics in https://github.com/Materials-Consortia/namespace-cheminformatics/pull/1.

Now about the remaining cheminformatics PRs. #436 introduces a new SMILES OPTIMADE data type and #398 introduces a new URI query parameter. I wonder whether property definition format and namespaces are ready to accept such extensions? If not, is this something that should be allowed to be extended in namespaces?

merkys avatar Jan 16 '24 13:01 merkys

Now about the remaining cheminformatics PRs. #436 introduces a new SMILES OPTIMADE data type and #398 introduces a new URI query parameter. I wonder whether property definition format and namespaces are ready to accept such extensions? If not, is this something that should be allowed to be extended in namespaces?

Once the current property defs etc. are merged, lets work on a similar design for user-defined datatypes. I'm thinking a similar declarative format as for units, properties for datatypes, where one with human language declare how every operator should work. However, I don't want to draft this until the property framework is merged.

The need for user-defined filer languages should perhaps inform the design of #398. Can we instead allow some syntax for the usual filter= to provide a list of filters with some kind of specifier what kind of filter it is? Then we can allow user-defined filter languages without new query parameters.

rartino avatar Jan 16 '24 14:01 rartino

Once the current property defs etc. are merged, lets work on a similar design for user-defined datatypes. I'm thinking a similar declarative format as for units, properties for datatypes, where one with human language declare how every operator should work. However, I don't want to draft this until the property framework is merged.

Makes sense.

The need for user-defined filer languages should perhaps inform the design of #398. Can we instead allow some syntax for the usual filter= to provide a list of filters with some kind of specifier what kind of filter it is? Then we can allow user-defined filter languages without new query parameters.

For now queries in filters act on property values only. Query by SMARTS will not be bound to a specific property. A possible solution would be to allow queries on entries themselves by introducing property-less operators, viz. /structures?filter=SMARTS "[CX4]".

merkys avatar Jan 17 '24 08:01 merkys

The need for user-defined filter languages should perhaps inform the design of https://github.com/Materials-Consortia/OPTIMADE/pull/398. Can we instead allow some syntax for the usual filter= to provide a list of filters with some kind of specifier what kind of filter it is? Then we can allow user-defined filter languages without new query parameters.

For now queries in filters act on property values only. Query by SMARTS will not be bound to a specific property. A possible solution would be to allow queries on entries themselves by introducing property-less operators, viz. /structures?filter=SMARTS "[CX4]".

Good point. After some additional thinking, I've realized that we already (of course) have a facility for prefixed ~~command line~~ query parameters, so #398 can just use that with the _cheminfo_ prefix. It can be included in the cheminformatics namespace provider specification right now by simply documenting its intended existence and operation in human-readable form, e.g., in the top README and/or the description field of the 'cheminformatics standard' source file. Moving forward, we need to tie all this machinery into OpenAPI schemas, thus making the declaration of this query parameter in an OpenAPI-compatible way.

rartino avatar Jan 17 '24 08:01 rartino

I suppose that by command line parameters you mean URL query parameters. Indeed, custom query parameters are already allowed. Thus #398 already can go to cheminformatics namespace.

merkys avatar Jan 17 '24 08:01 merkys