OPTIMADE
OPTIMADE copied to clipboard
Support for SMARTS filter language (`filter_smarts`)
In #368 support for SMARTS filter language was suggested. This PR:
- Describes how external filter languages should be supported in OPTIMADE;
- Introduces the first one of them, SMARTS.
In short, SMARTS queries are to be passed using filter_smarts
URL parameter. If appearing together with filter
in the same request, different filters are combined with logical AND.
SMARTS isn't really my field, so even if this PR is simple and I see no formal issues with it, I think it should be reviewed and approved by people who want this feature in the standard and can see themselves implement it down the line.
(I could add that I am a bit concerned over the lack of standardization of the SMARTS language discussed in #368 where it seems one may find different results depending on what SMARTS library the backend is using, so are these SMARTS queries really interoperable? But this isn't an issue I feel strongly about if there are people that want this feature.)
(I could add that I am a bit concerned over the lack of standardization of the SMARTS language discussed in #368 where it seems one may find different results depending on what SMARTS library the backend is using, so are these SMARTS queries really interoperable? But this isn't an issue I feel strongly about if there are people that want this feature.)
I am concerned as well about the same, but SMARTS seems to be the state of the art for at least a decade now. I suggested SMARTS both as a constructive solution for chemical structure search (need expressed in #368), as well as an argument to my thesis that there is no simple way to do it. We may as well scrap the SMARTS proposal by leaving each implementation to sort out how they would like to do chemical substructure search, but my gut feeling is that most of them would still rely on SMARTS under-the-hood.
We may as well scrap the SMARTS proposal by leaving each implementation to sort out how they would like to do chemical substructure search, but my gut feeling is that most of them would still rely on SMARTS under-the-hood.
Right; if these fields are not interoperable - i.e., database A's filter_smarts
does not really mean the same thing as database B's - is there really a benefit in using a standardized field rather than _a_filter_smarts
and _b_filter_smarts
?
Right; if these fields are not interoperable - i.e., database A's
filter_smarts
does not really mean the same thing as database B's - is there really a benefit in using a standardized field rather than_a_filter_smarts
and_b_filter_smarts
?
I am fine with not having OPTIMADE standardization for chemical (sub)structure search. I believe it was @JPBergsma who wanted such queries standardized.
@JPBergsma @d-beltran
I think you are the only ones active in the OPTIMADE repo that I suspect may have some experience with SMARTS? This is a short PR, perhaps you can look it over and just confirm if what is written makes sense to you?
@merkys
If we cannot find two people who are invested in getting this merged, it seems highly unlikely that this will be implemented any time soon. If so, I think we should close this PR for now to de-clutter the active PR list, and re-open it if someone shows up and asks for this feature (probably via #368).
If we cannot find two people who are invested in getting this merged, it seems highly unlikely that this will be implemented any time soon. If so, I think we should close this PR for now to de-clutter the active PR list, and re-open it if someone shows up and asks for this feature (probably via #368).
Tagging @vaitkus who may as well be interested. We come from the same database, thus it would be best to hear from others as well.
I am not particularly happy about closing prospective PRs though. Closed PRs come very close to being invisible - I guess no one visiting a repository looks at closed PRs. And open PRs give an impression of directions where a project is moving towards. Moreover, they invite for discussion. Thus I would vote for keeping all prospective PRs open. Maybe marking them with some tag (a la PR/requires-discussion
) is enough.
Sorry, I have no experience with SMARTS
I think I originally started the whole SMILES/SMARTS discussion after talking with someone from the Ocelot database. They have a database with crystal structures of organic molecules. For them being able top query the topology of the molecule, e.g., the SMILES string, is essential to be able to select the entries. I think we ended up with the idea of supporting SMARTS because we did not want to treat the SMILES field any different from a standard string. Because multiple valid SMILES strings can be generated for one molecule, a direct string comparison is not possible. So as a compromise, we decided to implement SMARTS, so databases would have a method to implement querying the SMILES string/molecular structure.
I have not worked with SMARTS queries myself. As we use an external standard, I however do not think that that is a problem. In the end, the important part of this PR is how we treat custom filters.
As I may have mentioned before, it is no longer possible to use one query for all databases with this PR, as a database that has not implemented a filter will return an error. This used to be one of the selling points of the OPTIMADE query language. I am therefore wondering whether we could use a virtual field, e.g., a field that can be used for querying like any normal OPTIMADE field but is not present in the returned JSON file. For example:
:query-url:http://example.org/optimade/v1/structures?filter=_filter_smarts=CaaO
would return all entries that contain a carbon atom that is attached to an aromatic atom, which in turn is bound to an aromatic atom bound to an oxygen atom.
This would maintain the full querying flexibility we have now, and if we apply a prefix it should be backwards compatible with existing databases, who would ignore it like any other unknown field with a known prefix.
@JPBergsma
Thanks for the clarification! So, are you (or someone else?) willing to seek out the Ocelot database and ask them to look at this PR? I mean, I understand it isn't exactly trivial to review a PR of a project you are not actively contributing to, but I think we should not merge a feature without some form of direct input from someone with experience of and/or stake in said feature.
As I may have mentioned before, it is no longer possible to use one query for all databases with this PR, as a database that has not implemented a filter will return an error. This used to be one of the selling points of the OPTIMADE query language.
One will never be able to do a SMARTS query on a database built on a backend with no support for it, so to allow this as an optional, but standardized, extension is only positive for interoperability. It is better that databases that support SMARTS do it in one unified way rather than inventing their own non-standard extensions. This is the design idea we've followed for many other optional features.
I am therefore wondering whether we could use a virtual field, e.g., a field that can be used for querying like any normal OPTIMADE field but is not present in the returned JSON file. For example:
:query-url:http://example.org/optimade/v1/structures?filter=_filter_smarts=CaaO
Maybe I'm missing something, but I don't see how this increases interoperability? What is a database that have no means of evaluating _filter_smarts going to do with this query?