OPTIMADE icon indicating copy to clipboard operation
OPTIMADE copied to clipboard

Overly strict `chemical_formula_anonymous`

Open shyamd opened this issue 3 years ago • 4 comments

It seems we have an ordering requirement for the anonymized chemical formula, that shouldn't be necessary for the actual search: the elements are instead first ordered by their chemical proportion number https://github.com/Materials-Consortia/OPTIMADE/blob/master/optimade.rst#chemical-formula-anonymous

A2BC == ABC2 == AB2C when performing an anonymized search. Is this something we can relax?

shyamd avatar Oct 08 '20 14:10 shyamd

Current ordering requirement seems to be in place to make the life easier both for servers and clients:

  • for servers: chemical_formula_hill can now be stored as a simple string in the backend, and searches for this property translate to straightforward string matching. If we were to relax the ordering requirement, this would become more difficult.
  • for clients: if a server is free to return any of total 56 variants for A3B2CDEFGH, client will have to parse and canonicalize the field on its own.

merkys avatar Oct 09 '20 06:10 merkys

I can understand for chemical_formula_hill since it has a defined structure and order. For anonymous and other less defined chemical formula fields, this could be problematic.

shyamd avatar Oct 13 '20 14:10 shyamd

As long as chemical_formula_anonymous is specified as a string, queries on it need to (at least IMO) follow string comparison semantics and cannot allow, e.g., "A2BC" == "ABC2".

We have discussed an alternative data type for chemical formulas to allow "chemical formula semantics comparison", but never really agreed on how that would be represented or exactly what it means. I notice that even with that in place, the "anonymous formula comparison semantics" are probably not the same.

The problem with allowing many different comparison semantics for different types of data is that every specific semantic adds quite a bit of load to the server-side implementations.

rartino avatar Jun 07 '21 13:06 rartino

I agree with @rartino, and would prefer not to have to support internal semantics in queries on string-valued properties.

merkys avatar Jun 07 '21 14:06 merkys

I think this can be closed as there was general consensus that canonicalizing on our particular anonymous formula format adds a lot of value in terms of search without much implementation overhead.

ml-evs avatar Mar 25 '24 13:03 ml-evs