OPTIMADE LIKE operators

It has been suggested (Issue #42, PR #69) to have operator similar to SQL "LIKE" in a query language. Since there are reservations about implementing 'LIKE "%string%"' construct as being too much geared towards a particular DB implementation, and 'LIKE "string"' would be inventing something superfluously similar to the SQL LIKE but different enough to be confusing, what about the following syntax:r

some_property MATCHES "*C*"

and interpret it as, say a Shell GLOB?

The idea is to have a query expression that is simpler and more widely supported in database engines than full-featured RegExps, but still useful for general substring searches,

Jun 12 '19 09:06 sauliusg

It seems from the issue #42 discussions that we have the following possibilities for querrying substrings:

STARTS WITH/ENDS WITH/CONTAINS;
LIKE with SQL-like semantics;
REGEXP of different kinds (PCRE, ERE, with or without Unicode support).

Each has increasing expressive power, but also may be increasingly difficult to implement on back-ends without the native support (for instance, MySQL at some stage required a plug-in to support regexp searches).

I suggest defining Filter language syntax and the following compatibility requirements:

STARTS WITH/ENDS WITH/CONTAINS – required (as it is now);
LIKE/UNLIKE with the SQL-like semantics – optional;
MATCH GLOB, MATCH PCRE, MATCH ERE (e.g. _cod_chemical_name MATCH PCRE "µ.*[Oo]cta[^\s]+[23]") – optional.

In this way, all backends will provide the same basic STARTS/ENDS/CONTAINS functionality (which is easy to implement using LIKE of SQL, REGEXPs or even a post-filtering); but we will also be able to introduce efficient extensions in a compatible way.

What others think?

Jul 12 '19 13:07 sauliusg

This issue was brought up by @ml-evs in today's Web meeting. @rartino asked the providers to express their opinions about it.

Personally, I like the proposal to support different string matching languages (SQL LIKE, GLOB, PCRE etc). However, I think that here we again encounter the same dilemma where by allowing diversity among providers we transfer the burden onto the clients. I easily image every provider preferring a different string matching language, and only that one.

SQL LIKE seems least demanding and, as such, least powerful. Maybe we can start from that and move on to the next ones?

By the way, I would rename LIKE operator to MATCHES SQLLIKE (not sure if _ in operator name would work) just to have simpler grammar. UNLIKE is the same as NOT LIKE, right?

Jan 21 '22 17:01 merkys