OPTIMADE Like operator

Offering implementation of the "LIKE", "NOT LIKE" and "UNLIKE" string matching operators in the Filter grammar, and briefly describing their semantics.

Alternatively, offering implementation of the "MATCH /regexp/" operator.

Jul 15 '19 13:07 sauliusg

They say that this is just because they first C-unescape the string, eg to accept \n as a newline. We don't, so we should just say \\ to escape a backslash. Whoever converts the string to MySQL must then first C-escape whatever its obtained from the query string.

Jul 16 '19 06:07 giovannipizzi

Maybe it's still good to point this out to implementers though, as I'm sure this will be a common error (but anyway if you don't want SQL injection you should encode your string, is the user passes a quote in itts string....)

Jul 16 '19 06:07 giovannipizzi

Is the hope to get this in for v0.10? Because, after having mulled over this a bit since the meeting last Friday, I think this is something we may want to discuss a bit more before merging.

(To me, the point with the several rushed PRs this week is to make sure the spec is in a sane state for a v0.10 release, i.e., internally consistent and preferably without having rough completely backwards-incompatible changes just around the corner--- not so much to get new features in last-minute. I'd be happy to see a release v0.11 with this feature in shortly after v0.10.)

Now, about the content of this PR:

OPTiMaDe filters are designed to be user friendly for our target audience of 'client' end users, right? Hence, even as someone who knows SQL quite well, I'm not super stoked on building in the SQL convention into OPTiMaDe:

This choice has a very non-backend-agnostic "feel". Won't this antagonize those without SQL backends? "So, to implement OPTiMaDe we have to translate wildcard expressions from the SQL convention to our backend?"
I expect most client users of OPTiMaDe to not know SQL and the conventions of its LIKE operator.
I expect those users to find SQL's % and _ unexpected for a wildcard mini-language. Just Google "wildcard" and disregard the hits about SQL; you'll find a wide array of descriptions of built-in functions across various software that very universally uses * for multi-character wildcards, and often ? for single character matches. Two notable examples includes:
- Shell glob patterns in Unix shells, Windows powershell, and even MS-DOS.
- Microsoft excel, access, etc. as well as all software that mimics the same functions (google sheets, libreoffice) (though, libreoffice apparently also supports regex syntax.)
I suspect users are likely to have come across one of these.
Maybe worst: as coming out from the point @giovannipizzi brought up, I don't think there is a single escape syntax standard for SQL LIKE valid across SQL backends, so this is not a great standard anyway... And, we must make sure escaping in our LIKE does not interfere with string escaping in our string tokens. Reading the specification we cannot say that identifier LIKE "\%" is a search for a literal percent sign, because the right hand side is not a valid string token. A valid string token can only contain \ followed by a " or a \. So, for us the right hand side would have to be "\\%", and how do we then encode a literal backslash + a wildcard %?

So, my suggestion is for us to skip SQL LIKE entirely, and give the operator a different name, e.g., PATTERN with the conventions:

* for any number of arbitrary characters (including zero).
? for exactly one arbitrary character.
** for a literal *
*? for a literal ? (whereas ?? indicates two arbitrary characters, and ?* is the standard way to indicate an arbitrary character sequence of at least one character)
the use of \ remains as for any string token: \\ for literal \ and \" for literal ".

This seems somewhat user-friendly, and would be trivial to search-replace into a corresponding SQL LIKE operator for us with SQL backends. (But I'd be ok with any other sensible quoting scheme as well.)

Jul 16 '19 22:07 rartino

I see @rartino's point and I tend to agree that, if there is no agreement, it's better not to merge this in 0.10 yet. And to admit, I also think * and ? are more intuitive (and should be easy to replace for SQL syntax in the backend).

A note on the choice of escaping: I think this is not yet "perfect":

** as a literal *: would then *** mean [a] "a literal * followed by any sequence of characters" or [b] "any sequence of characters followed by a literal *"? Might seem a little detail, but for instance the string ab*cd would match the pattern ab*** in interpretation [a] but not in interpretation [b].
Similar concern for ???
For ?? there is also an additional concern: how then can we specify that we want exactly two characters?

I think the only solution is to have a third character for escaping. I see that using backslash is counterintuitive because one has to escape the string again (indeed to encode \\ one would have to write \\\\ e.g. in JSON) and multiple escaping is the hardest thing ever to get right ;-) )

Jul 17 '19 05:07 giovannipizzi

@giovannipizzi are indeed correct about the quoting scheme not yet being perfect.

(For anyone confused by the discussion about ??, ???, these are based on the first version of my comment; shortly after posting it I noticed the problem with ?? myself and re-edited it to use \*?, thinking no one would notice, sorry. But even with those edits, *** remains broken.)

I have a different suggestion for quoting now: going forward, even with other functionality, I think it may be very useful if we can extend our string tokens to represent "literal characters" aside "normal characters" using a single backslash.

It would require a change in the grammar, so \<character> is allowed for any character in a string token, marking that character as "literal". (Right now it is forbidden by the grammar for <character> to be anything else but a " and \.)

We then need some carefully formulated text in the specification, where we say that for all regular uses of a string token, a literal character \<character> means the same thing as <character>. However, in other situations where some special meaning has been assigned to characters, a literal character refers to that character without that special meaning.

Then in a PATTERN, ? and * are assigned the special meaning as specified above. It then follows from the paragraph above that \? and \* are used to represent the literal ? and *. I think this would be elegant.

Note: this solution allows marking any character in an OPTiMaDe string as "literal" with the following exception: the characters " and \ are always literal, one cannot indicate non-literal versions of these characters. I think that is an OK trade-off. Practically this could mean limitations on what syntax we can allow for other mini-languages embedded in OPTiMaDe strings, e.g., if we want to adopt a standard language for time formatting that uses " and \ with their own special meanings, etc.

We might want to think forward to make sure this allows embedding a reasonable regex format. I think it does.

Jul 17 '19 07:07 rartino

@rartino wrote:

So, my suggestion is for us to skip SQL LIKE entirely, and give the operator a different name, e.g., PATTERN with the conventions:

for any number of arbitrary characters (including zero). ? for exactly one arbitrary character. ** for a literal * ? for a literal ? (whereas ?? indicates two arbitrary characters, and ? is the standard way to indicate an arbitrary character sequence of at least one character) the use of \ remains as for any string token: \ for literal \ and " for literal ".

I though about this in the beginning (Issue https://github.com/Materials-Consortia/OPTiMaDe/issues/87), and then I realised that this is a very bad idea. Essentially we put on our shoulders a burden of developing yet another wildcard language, with the explicit goal to be not similar to existing ones, and with no apparent gains. This is a large work; as @giovannipizzi has just pointed out it is very easy to get it wrong; it gives no clear benefit, and adds extra burden on every implementer (because now everyone will have to implement, test and support this notation), and on every OPTiMaDe user (since everyone will have to learn a new wildcard language). I think this is very inefficient and unnecessary, since we will hardly give in the new wildcard syntax any new essential features that one does not have in SQL patterns, globs or regexps.

And then, what is the reason to be not similar to SQL, or any other pattern/regexp standard? One might object to it if it were a MUST feature and one would have to re-implement it in an SQL-unfriendly backend; but since it is supposed to be an optional feature, those backends that find it difficult to implement can simply skip it, and offer their versions of 'MATCH REGEXP' instead. I think sticking to a known standard, if that standard is well documented, has clear benefits: a/ one can find working and tested implementations; b/ the documentation is out there c/ many people will already know the syntax d/ those who learn it anew will be able to use it also in other contexts (thus will have more incentives to learn it)

I therefore suggest the following strategy:

let's settle on several widespread standards with accessible implementations. I suggest picking SQL LIKE patterns, ERE regular expressions, and PCRE regular expressions. It covers a large area of useful cases.
All such matches will be optional. Thus, the implementer can choose to implement those matching operators that are easy to match to their respective backend, and skip all the rest. Nobody should be put at disadvantage by this.
let's have in mind possible extensions of the patterns. This is the reason I suggested (in https://github.com/Materials-Consortia/OPTiMaDe/issues/87) to have regexp type specified in the match expression (e.g. MATCH PCRE); this will allow adding new flavours of regexps would such extensions be needed in the future (I personally doubt that it will be ever needed, but you never know what new "revolutionary" DBs will be rolled out with incompatible and non-standard interfaces ;). One possible backend which we up to now did not discuss much is SparQL; I'm not sure if it uses its own match expressions or some standard ones. Still, it fits into the proposed scheme, either with 'MATCH ERE' if SparQL supports ERE, or with 'MATCH SPARQL' if SparQL has (will have) its own matching syntax, and we decide to support it in the future.
OPTiMaDe in itself is neutral as to the implementation of the matches, but we need to specify what the semantics of the matches will be in case they are implemented. Thus I think we can do it without extension of the existing Filter grammar.

In such setting, which is very simple, and is actually implied in this PR, I think we could make it even into v0.10 :).

Regards, Saulius

Jul 17 '19 08:07 sauliusg

@rartino wrote:

It would require a change in the grammar, so \<character> is allowed for any character in a string token, marking that character as "literal". (Right now it is forbidden by the grammar for <character> to be anything else but a " and \.)

It could be done, but at the moment I think it is totally unnecessary. Filter language itself does not define or handle regular expressions, just strings; strings can encode arbitrary character sequences, including backslash with any characters; we just need to escape the backslash as a double backslash. Thus, an ERE matching arbitrary number of stars after the a character would be a\**, and embedded into a Filter string it will become "a\\**". The Filter language quotes and backslashes are one encapsulation level higher than those of regexps. The situation is similar as with Java regular expressions.

Jul 17 '19 08:07 sauliusg

@rartino

* Reading the specification we cannot say that `identifier LIKE "\%"` is a search for a literal percent sign, because the right hand side is not a valid string token. A valid string token can only contain `\` followed by a `"` or a `\`.

True

So, for us the right hand side would have to be "\\%",

True, this is a logical consequence of the current grammar. Java does the same with regexps in strings, awk does it the same way, and I suspect Python has no other option but to behave the same.

and how do we then encode a literal backslash + a wildcard %?

a LIKE "\\\\%" Not the nicest looking but usable (and in fact used in many occasions).

I still need to double-check how SQL escapes %.

Jul 17 '19 08:07 sauliusg

* This choice has a very non-backend-agnostic "feel". Won't this antagonize those without SQL backends?

Why should it? As said, if the people find it difficult to implement, they just skip it (the feature is supposed to be optional). Unless they feel bitter envious that SQL has better pattern matching than their beloved super-modern backend, there should be no problems :P. And then again, I suggest also introducing (again optional) match operator syntax that would be easy to implement in alternative, non-SQL backends.

Jul 17 '19 08:07 sauliusg

I expect most client users of OPTiMaDe to not know SQL and the conventions of its LIKE operator.

Why? At least currently, most of the OPTiMaDe consrtium members do know about at least something SQL :)

I expect those users to find SQL's % and _ unexpected for a wildcard mini-language.

One can always learn :). One just search for "SQL LIKE", and in 5 mins one has the answer. Also, we will include a short description and a couple of examples into the OPTiMaDe spec (I'll update my PR with these features).

Just Google "wildcard" and disregard the hits about SQL;

Ehhh... why should we disregard SQL? Actually, DuckDuckGo on the "wildcard" search gives SQL on the first page, slightly below Wicktionary's description, and the top-most link in my current search (https://www.computerhope.com/jargon/w/wildcard.htm) mentions SQL convention ('%') as well, along with the asterisk convention.

you'll find a wide array of descriptions of built-in functions across various software that very universally uses * for multi-character wildcards, and often ? for single character matches. Two notable examples includes:

Shell glob patterns in Unix shells, Windows powershell, and even MS-DOS.

That is why I suggested (https://github.com/Materials-Consortia/OPTiMaDe/issues/87) also describing the MATCH GLOB operator, with shell glob semantics; I can include it into the PR; I just had impression that we decided to have only LIKE and no more advanced operators when we skyped in Friday.

Should we also include MATCH GLOB into the syntax?

Microsoft excel, access, etc. as well as all software that mimics the same functions (google sheets, libreoffice) (though, libreoffice apparently also supports regex syntax.)

Again, this would be an argument to include an optional MATCH ERE and MATCH PCRE operator.

What does MongoDB and/or CoachDB support, actually? These would be probably the most likely backends that some of us actually use.

Jul 17 '19 08:07 sauliusg

I'll get to the individual points, but I see an important philosophical difference here, so lets start with sorting that out:

let's settle on several widespread standards with accessible implementations. I suggest picking SQL LIKE patterns, ERE regular expressions, and PCRE regular expressions. It covers a large area of useful cases.

All such matches will be optional. Thus, the implementer can choose to implement those matching operators that are easy to match to their respective backend, and skip all the rest. Nobody should be put at disadvantage by this.

When we agreed on the need for optional features in the filtering language, you were very concerned about the fragmentation in support for filters that this would lead to. My argument was that optional features actually helps avoid fragmentation, because instead of every provider inventing their own extension for feature X, we now say in the standard that "you don't have to support feature X, but if you do, this is how you do it." That way, all providers that support feature X do it the same way, and queries are at least universal across all databases that support that feature.

However, to get this effect, we need to be conservative with adding several optional features that are overlapping, i.e. exactly what you propose to do here. If we do, we get back to the same fragmentation we were trying to avoid. Now all databases will support only their own favorite brand of partial text matching operators, and partial text matching queries are not going to be universal even across databases that support partial text matching. Clients will constantly have to rewrite their queries to fit the text matching capabilities of each database. This, to me, is a too high cost for everyone to get their favorite syntax in.

So, I'm very much against supporting e.g. multiple regex flavors, and rather want us to standardize on a flavor that is translatable into the specific ones used by the backends. I realize I put more work on implementors here, but this is necessary to achieve the main goal of a universal api: to provide a common interface for access.

Edit: While the discussion in this comment is formulated about regexs, it very much applies to the suggestion to provide both an SQL LIKE operator and a GLOB operator (and perhaps also other pattern operators). This means end users of OPTiMaDe would have to learn all the pattern languages to use OPTiMaDe against multiple databases. This does not scale well.

Jul 17 '19 09:07 rartino

On the specific points:

Essentially we put on our shoulders a burden of developing yet another wildcard language, with the explicit goal to be not similar to existing ones, and with no apparent gains. This is a large work; as @giovannipizzi has just pointed out it is very easy to get it wrong;

I'm with you if we are talking about full regex support, I don't think we want to invent a new flavor of regexes. But, if we talk about patterns with character wildcards, that is trivial apart from how to do escaping. And, unless I am mistaken, there is no SQL standard for that escaping. If we pick one, e.g. MySQL, we still have to do the same amount of work to make sure we embedd it in a way consistent with our string quoting mechanism. This isn't easier than starting from scratch; and for all intents and purposes, we are creating a new pattern language in the eyes of users, because the actual strings they need to input are different in, e.g., how many backslashes are needed.

In fact, what is likely to happen with the SQL approach is that users wondering how to represent a literal % will find a stack overflow question about, e.g., postgree SQL, try that syntax with whatever number of backslashes they use, and then be confused over the exact same string behaving differently in OPTiMaDe.

Filter language itself does not define or handle regular expressions, just strings; strings can encode arbitrary character sequences, including backslash with any characters; we just need to escape the backslash as a double backslash.

Right, but the point in much of the design we do is for the filter language to be user friendly. I know from heavy amount of experience that no one appreciates this "backslash hell" where several quoting mechanisms are stacked to yield unreadable expressions. If we can support single backslash qouting, is it not nice to do so?

Ehhh... why should we disregard SQL?

My point here was that it is only SQL that uses % and _. There are far more pattern mini-language using * and ?. Hence, for users not familiar with SQL LiKE, this is going to be a surprise.

At least currently, most of the OPTiMaDe consrtium members do know about at least something SQL :)

We are all database providers. Who are you designing OPTiMaDe for? I'm having the PhD students at our department in mind who do various projects that would benefit from easy aggregating of data across databases. I think few of them know SQL.

Jul 17 '19 10:07 rartino

I'm with you if we are talking about full regex support,

ACK. And so, one flavour of matches should include RE support; but we do not make it mandatory and do not want to exclude the use of simpler matching mechanisms, do we? The proposed solution, with 'LIKE', 'MATCH ERE', 'MATCH PCRE' would permit standard implementation of various levels of patterns.

I don't think we want to invent a new flavor of regexes.

I agree, we don't :)

But, if we talk about patterns with character wildcards, that is trivial apart from how to do escaping.

I don't find it trivial; it's not just regexp search/replace, since we will need to deal with escaped characters and character ranges within [] brackets. I feel that a client will have to do a full pattern parsing, and then map it into the backend's patterns/regexps if we want to have specific OPTiMaDe pattern syntax.

And, unless I am mistaken, there is no SQL standard for that escaping.

I think there is; I would be very much surprised if it did not.

I do not have the latest SQL standard at hand (costs 180$ with ISO :/), but open sources on the Web (https://modern-sql.com/standard, http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt) specify that the syntax is column LIKE 'pattern%' or column LIKE '30\%pattern% ESCAPE '\', where ESCAPE allows to specify any escape character one wishes. MySQL defaults ESCAPE to ''; other SQL description do not mention any default, so probably default backslash escape is MySQL specific extension. That does not cause difficulties if we specify backslash as an escape character for OPTIMADE since all SQL databases should support the standard ESCAPE clause.

If we pick one, e.g. MySQL, we still have to do the same amount of work to make sure we embedd it in a way consistent with our string quoting mechanism.

As said, SQL ESCAPE mechanism should allow to specify escape character for any SQL engine.

This isn't easier than starting from scratch; and for all intents and purposes,

I would say it is much easier than starting from scratch, we just take a (subset) of a standard implementation.

we are creating a new pattern language in the eyes of users,

Not if we take an existing standard(s) and reference it/them.

because the actual strings they need to input are different in, e.g., how many backslashes are needed.

We should not confuse defining the meaning of patterns, and defining the way how to encode patterns into strings. Currently the OPTiMaDe string definition allows encoding any character sequence (using escapes); so we have no problem here. If we think that too many backslashes in escaped strings are awkward to read (I agree with this), then we can think about how to improve OPTiMaDe string escaping; this is an issue independent from match-pattern semantics, IMHO.

In fact, what is likely to happen with the SQL approach is that users wondering how to represent a literal % will find a stack overflow question about, e.g., postgree SQL, try that syntax with whatever number of backslashes they use, and then be confused over the exact same string behaving differently in OPTiMaDe.

I don't find this argument convincing. If people take Postgress SQL specific (non-standard) rules and try to apply to something else, be it OPTiMaDe, MySQL, MogoDB or what not, of course they will get nonsense, and here you can not help them in any way – except maybe by writing decent OPTiMaDe documentation (that specified how OPTiMaDe patterns work) and advising them to use '-stackoverflow' in their Google searches :P.

Jul 19 '19 13:07 sauliusg

My point here was that it is only SQL that uses % and _. There are far more pattern mini-language using * and ?. Hence, for users not familiar with SQL LiKE, this is going to be a surprise.

SQL seems to be still the most widely used database query language. Its patterns, as I see now after some days of searching on the Web, are very well described and present in all searches. Disregarding this as "only SQL" does not seem wise to me. Nobody will be surprised at SQL pattern languages if they do a short search on the Web.

Yes, there are essentially two match syntaxes – SQL LIKE and PCRE. All the rest is variations and subsets of these two. SQL seems to be a fairly large chunk of DB implementations; probably SQL LIKE is the most frequently supported backend operator, if we take by number of installations, no?

Another is ERE/PCRE. Also fairly popular. That's why I suggest having possibility to use either or both of them in OPTiMaDe queries.

Jul 19 '19 13:07 sauliusg

and advising them to use '-stackoverflow' in their Google searches :P.

This won't help. Web is full of other how-tos. If we refer to some other standard in our specification, then we have to follow it as closely as possible. It's the "SQL" in "SQLish" that may get the user searching the Web and trying first hits of SQL syntax how-tos.

Jul 19 '19 13:07 merkys

When we agreed on the need for optional features in the filtering language, you were very concerned about the fragmentation in support for filters that this would lead to. My argument was that optional features actually helps avoid fragmentation, because instead of every provider inventing their own extension for feature X, we now say in the standard that "you don't have to support feature X, but if you do, this is how you do it." That way, all providers that support feature X do it the same way, and queries are at least universal across all databases that support that feature.

I agree, and this argument convinced me that optional features are not bad. They will be even better if supplemented by machine-readable descriptions of "feature sets" (see https://github.com/Materials-Consortia/OPTiMaDe/issues/91, second suggestion).

However, to get this effect, we need to be conservative with adding several optional features that are overlapping,

I agree if the features are completely overlapping, but here we have different features with different trade-offs between search power and implementation complexity.

i.e. exactly what you propose to do here.

This is not true. I propose three well-defined optional features, so that, as you say, "if you do [them], this is how you do it.". Implementers then can choose which ones to implement.

If we do, we get back to the same fragmentation we were trying to avoid.

Not really.

Now all databases will support only their own favorite brand of partial text matching operators,

NO! If compliant to OPTiMaDe, they will support the ones described in the spec.

and partial text matching queries are not going to be universal even across databases that support partial text matching. Clients will constantly have to rewrite their queries to fit the text matching capabilities of each database.

With optional features you will have to adapt your queries to different implementation, no way around it as long as you introduce options. This could only be made somewhat easier if each OPTiMaDe endpoint advertises in a machine-readable form which optional features they support (the above-mentioned "feature sets:").

And then, you do not need to re-write constantly, there will be no constant changes once the specs are out, will there?

This, to me, is a too high cost for everyone to get their favorite syntax in.

Its not about favourite syntax, its about easier implementation. We can not require all implementers say "look, we have invented our own matching semantics, no go and implement it in your backend".

So, I'm very much against supporting e.g. multiple regex flavors, and rather want us to standardize on a flavor that is translatable into the specific ones used by the backends.

At the moment I think that "translatable into the specific ones used by the backends" is too difficult to be reasonably implementable.

I realize I put more work on implementors here, but this is necessary to achieve the main goal of a universal api: to provide a common interface for access.

I think the burden you suggest for implementers is unacceptably high. In that case, we are probably better of with just what is in the specs right now, STARTS/ENDS/CONTAINS. Regexp filtering can then be done at the client side if needed...

Jul 19 '19 13:07 sauliusg

and advising them to use '-stackoverflow' in their Google searches :P.

This won't help. Web is full of other how-tos. If we refer to some other standard in our specification, then we have to follow it as closely as possible. It's the "SQL" in "SQLish" that may get the user searching the Web and trying first hits of SQL syntax how-tos.

This would be a good reason to make LIKE behaviour as close to the SQL's as possible, this is my main point.

Jul 19 '19 14:07 sauliusg

@rartino :

If we do, we get back to the same fragmentation we were trying to avoid.

One more though on avoiding fragmentation.

Lets consider the following scenario. I want to send a query to find entries that match RE 'Name.*Surname' to multiple OPTiMaDe databases. I want to use pattern match, but I do not know which databases implement which features.

I could compose the following Filter expression: (author LIKE "%Name%Surname%" OR author MATCH GLOB "*[Nn]ame*[Ss]urname*" OR author MATCH REGEXP "[Nn]ame.*[Ss]urname"). Let's assume that in an OR expression, unimplemented operators return FALSE, and implemented ones the expected values. Then, if a database implements at least one of the queries, I will get the expected answer; and I could send this query to all databases.

The only problem is when an OPTiMaDe endpoint does not implement any of these operators; then I would get no data at all instead of getting the right answer.

Another approach would be, as @rartino suggests, to have the author MATCH "*[Nn]ame*[Ss]urname*" operator, with wildcards specified by the OPTiMaDe, which each API implementation will have to translate to the appropriate backend query.

The difference between the two approaches are: in the first case, we put translation burden on the user; in the second case we put the translation burden on the API implementers. Arguably, it is easier to compose a query in different languages, since you know what you are looking for, than to parse a formal definition and translate it in all cases to another language. So, assuming qualification of both users and implementers is the same, composing multiple queries is easier than translating. It is only if we target very inexperienced users that composing queries becomes an issue.

Jul 19 '19 17:07 sauliusg

I could compose the following Filter expression: (author LIKE "%Name%Surname%" OR author MATCH GLOB "*[Nn]ame*[Ss]urname*" OR author MATCH REGEXP "[Nn]ame.*[Ss]urname"). Let's assume that in an OR expression, unimplemented operators return FALSE, and implemented ones the expected values. Then, if a database implements at least one of the queries, I will get the expected answer; and I could send this query to all databases.

So, in effect, you propose that the syntax in OPTiMaDe for partial string matching is to give manually translated versions of your pattern to cover ALL relevant flavors of pattern languages? You don't think this is overly cumbersome to use, even for experienced users? To me, your example shows precisely why supporting all these overlapping options all as individually optional is a bad idea.

The translator approach is not as bad as you suggest. If we are talking about patterns (and not regexes) it is going to be a more or less straightforward string transform. And as soon as someone implements these, the code can be converted to other languages and distributed as support libraries. Whereas if we put all of the burden on the user as you propose, you can bet every second query is going to have one of the redundant patterns translated in error, and users will act on incomplete search results without knowing it.

However, to get this effect, we need to be conservative with adding several optional features that are overlapping,

I agree if the features are completely overlapping, [...]

If you agree with that statement; can we perhaps discuss a "ladder" of complexity levels? This is my view:

Basic operations: STARTS WITH, ENDS WITH, CONTAINS: required
Basic pattern with arbitrary character string (e.g., * or %), arbitrary single character (e.g., ? or _): optional
Full regex matching support: optional

What I am trying to say is that it would be optimal to only provide ONE facility for text matching on each level. E.g., SQL LIKE or GLOB-like or something else for level 2. But, I agree that Level 3 is trickier. Maybe we can postpone the discussion of REGEX flavors for a while, because I'd like to think more on the question "translating" in this context. And this PR is about level 2.

With reference to this ladder of complexity, I propose that if an API implementer supports a higher level, they MUST also support all lower levels. (This should be ok, as the lower levels are translatable into higher level expressions.) This means that we can recommend a user to use the lowest level at which a query is expressible, and it will then be handled by the highest possible number of databases.

Now, specifically about whether the SQL LIKE standard is a good idea for OPTiMaDe level 2:

if we talk about patterns with character wildcards, that is trivial apart from how to do escaping.

I don't find it trivial; it's not just regexp search/replace, since we will need to deal with escaped characters and character ranges within [] brackets. I feel that a client will have to do a full pattern parsing, and then map it into the backend's patterns/regexps if we want to have specific OPTiMaDe pattern syntax.

Why are you talking about character ranges? I'm talking about LIKE-equivalent (level 2) patterns here as you have described it in your PR, so one arbitrary length wildcard, and one single-character wildcard + escaping rules for those. And the mapping I am talking about then is from our "own" syntax (? and *) into the LIKE syntax you want (% and _), not into the backends regex format. You are certainly allowed to implement these patterns via SQL LIKE, and those who don't will have an equally hard time from this point as with an SQL LIKE-based syntax.

I do not have the latest SQL standard at hand (costs 180$ with ISO :/),

Isn't this a major problem in itself that not even we have access to the standard you want to base the OPTiMaDe pattern operator on? Especially since we conclude the need to firmly point users to it, since, there are a LOT of helpful sources on the Internet that cover database-specific flavors that isn't exactly implementing the standard. (E.g., MS SQL Server supports [ and ] for ranges, whereas it seems few others do.)

I'm already running into this issue, as for my response above, I tried to figure out if the SQL standard LIKE actually supports character ranges with [ and ], but I find no place I can confirm of deny it with certainty. At least in the 1992 draft you linked, they are not supported.

the syntax is column LIKE 'pattern%' or column LIKE '30\%pattern% ESCAPE '\', where ESCAPE allows to specify any escape character one wishes. MySQL defaults ESCAPE to ''; other SQL description do not mention any default, so probably default backslash escape is MySQL specific extension. That does not cause difficulties if we specify backslash as an escape character for OPTIMADE since all SQL databases should support the standard ESCAPE clause.

As said, SQL ESCAPE mechanism should allow to specify escape character for any SQL engine.

After researching this, I'll accept that LIKE with ESCAPE behaves fairly standard across SQL backends, aside that some add their own pattern extensions... Without ESCAPE, not so much. In my reading of the 1992 draft, it seems the default should be no escape symbol at all. This is apparently what Oracle uses (according to a help page for v10.1 I found online). But then many other implementations (MySQL, postgresql) takes backslash as the default, which leads to confusing "help" webpages.

What is the concrete proposal for OPTiMaDe? That we also include the ESCAPE keyword? I.e., to search for a literal backslash + wildcard one have these options for filter: identifier LIKE "\\\\%" ESCAPE "\\" or, e.g., identifier LIKE "#%" ESCAPE "#"? What is our default escape char?

My end conclusion is that I suppose it wouldn't be a catastrophe to standardize on LIKE + ESCAPE, but I'm far from convinced it is the best choice, since:

There is no openly accessible standard to point users to.
Many actual SQL implementations do not follow the standard, which makes it confusing for end users who seek help (which is worsened by point 1)
For those who want to support LIKE without having an SQL backend, the construct with arbitrary escape character adds extra work with not very much benefit.
End users with little experience in SQL will find % and _ unusual as wildcards, and the syntax with LIKE + ESCAPE bulky.

And what is the gain compared to the proposed alternative?:
identifier MATCHES <PATTERN> where in <PATTERN>:

* means any number of arbitrary characters (including zero).
? means exactly one arbitrary character.
\* is a literal *
\? is a literal ?
\\ is a literal \

which you can translate to an SQL LIKE query for your backend this way:

Make these substitutions in <PATTERN>:
- % -> \%
- _ -> \_
- ? -> _
- * -> %
Query your SQL backend with identifier LIKE <PATTERN> ESCAPE "\"

Jul 19 '19 21:07 rartino

My end conclusion is that I suppose it wouldn't be a catastrophe to standardize on LIKE + ESCAPE, but I'm far from convinced it is the best choice, since:

I agree. I also think we both agree that ESCAPE would be too bulky and unnecessary for OPTiMaDe.

1. There is no openly accessible standard to point users to.

Well, not quite true, one is (SQL-92) that describes the LIKE predicate, and from the documents that we discussed privately we can see that SQL-99 did not change that definition significantly.

2. Many actual SQL implementations do not follow the standard, which makes it confusing for end users who seek help (which is worsened by point 1)

True, but those implementations only extend the standard, so column LIKE pattern ESCAPE '\' should be portable across SQL implementations. Thus, it should be easy to implement LIKE with a backslash escape character if OPTiMaDe standardises that.

3. For those who want to support LIKE without having an SQL backend, the construct with arbitrary escape character adds extra work with not very much benefit.

I think we all agree that OPTiMaDe should specify one fixed escape character (or one fixed escape mechanism, such as character doubling).

Those who can not easily implement LIKE in their backends can just skip it (it is an optional feature). If they in addition implement full RE match (I understand MongoDB supports that), then they are well equipped even without LIKE :)

4. End users with little experience in SQL will find `%` and `_` unusual as wildcards, and the syntax with LIKE + ESCAPE bulky.

After all the research that we have done on the Web, I strongly disagree that "% and _ [are] unusual as wildcards". Rather, I would say, * and ? are limited to the shell glob community... :)

Jul 20 '19 09:07 sauliusg

And what is the gain compared to the proposed alternative?: identifier MATCHES <PATTERN> where in <PATTERN>:
* `*` means any number of arbitrary characters (including zero).

* `?` means exactly one arbitrary character.

* `\*` is a literal `*`

* `\?` is a literal `?`

* `\\` is a literal `\`

So essentially we arrive at the same SQL LIKE convention, with the backslash ('\') as an escape character :). I'm not against that, since, in addition to being a reworded SQL LIKE, this is superficially similar to shell GLOBS, but I want to point out that '*' and '?' have no clear advantage over '%' and '_'; as said, I strongly disagree that '%' are "unusual"; moreover, '?' is confusing since REs use '.' instead, and ? means something slightly different there (I myself was confused by this in my young days ;).

Jul 20 '19 09:07 sauliusg

which you can translate to an SQL LIKE query for your backend this way:
1. Make these substitutions in `<PATTERN>`:
   
   * `%` -> `\%`
   * `_` -> `\_`
   * `?` -> `_`
   * `*` -> `%`

This is not that straightforward as you write, since \* is not translated to \%, it translates to *, and \\* would not translate to \*, what is meant is presumably 'backslash-any_character' which would be \\%, whereas \\a is 'backslash-a' which translates to \\a in LIKE...

In other words, the translation is doable but requires extra work to get it right. My original instinct is that I want to avoid that work if possible.

Jul 20 '19 10:07 sauliusg

If you agree with that statement; can we perhaps discuss a "ladder" of complexity levels? This is my view:
1. Basic operations: STARTS WITH, ENDS WITH, CONTAINS: _required_

2. Basic pattern with arbitrary character string (e.g., * or %), arbitrary single character (e.g., ? or _): _optional_

3. Full regex matching support: _optional_

Yes, I agree on this complexity ladder, this is what I meant. We can settle on this :)

Jul 20 '19 10:07 sauliusg

After researching this, I'll accept that LIKE with ESCAPE behaves fairly standard across SQL backends,

I also find it so.

aside that some add their own pattern extensions...

I have the same impression.

Without ESCAPE, not so much.

Probably yes...

In my reading of the 1992 draft, it seems the default should be no escape symbol at all.

True, I think.

This is apparently what Oracle uses (according to a help page for v10.1 I found online).

I thought Oracle has a (non-standard?) feature SET ESCAPE ... that changes the behaviour, bu we do not need to rely on that.

But then many other implementations (MySQL, postgresql) takes backslash as the default,

True

which leads to confusing "help" webpages.

OPTiMaDe should be the only Help :)

What is the concrete proposal for OPTiMaDe? That we also include the ESCAPE keyword?

property LIKE pattern with default backslash ('\') as an escape character, translatable to SQL as:

property_column LIKE pattern ESCAPE '\'

with pattern taken more-or-less verbatim after resolving OPTiMaDe string double-quote and backslash escapes). More-or-less dependes on whether we take % and _ as wildcards (then it will be verbatim), or * and ? as wildcards, then it will need translation with good chances that people will get it incorrectly :)

Jul 20 '19 10:07 sauliusg

Why are you talking about character ranges? I'm talking about LIKE-equivalent (level 2) patterns here as you have described it in your PR

I must have meant the most general case, level 3, with regexps.

This is the whole point. Do we design syntax only for the level 2 patterns (LIKE), or do we want it to be general enough to be usable for level 3 as well? If it is only for LIKE, then regexps will have different metacharacters and different escape rules anyway (that is why I mentioned '[]' brackets). This would imply different syntax constructs for different complexity levels, as I suggest at the moment. And since SQL LIKE seems to be the only existing standardised implementation giving the level 2 complexity, I see no reason why we should deviate from it.

If we want just a single OPTiMaDe operator, MATCH, that covers all complexity levels (essentially, 2 and 3; do I understand correctly that this is you current suggestion?), then we need to think how we convey them in the syntax. Should brackets '[]' be escaped by a back-slash even at the level 2? Should bare [] be reported at level 2 as errors even though they are not used for matching? Should a? be a regexp "maybe-a", or is it "a followed by an arbitrary character"? Will it be the same at level 2 and level 3? We need to decide all these issues before we move on; seems to my quite tricky to get it right.

My suggestion is essentially to segregate level 2 and level 3 syntactically, by having predicated LIKE pattern (for level 2) and predicate MATCH ERE pattern (for level 3). True, this will require the user to know all these different pattern encodings; but a serious computer user will have to know them anyway, IMHO, and indicating syntactically which complexity level we mean will simplify the job for specification writers (meaning us ;), for implementers, and arguably also for query users (there is no confusion ABOUT what does ? mean in p LIKE 'xa?' and in p MATCH ERE 'xa?')

We can, of course, say that level 3 is not supported by OPTiMaDe at all, but that would be a pity, since some backends allow to implement RE search easily, and people will start doing this in their own incompatible ways, which we wanted to avoid by describing optional (but standardised) features.

As a side note: SQL (SQL-99) has SIMLAR TO predicate that supports bracketed character ranges in a standard way. Not all backends seem to implement it, though (MySQL that we have does not). I am not suggesting it for OPTiMaDe :)

Jul 20 '19 10:07 sauliusg

After talking with Andrius over the lunch, we came up with an alternative (compromise?) solution:

Let's leave just two "complexity levels":

STARTS/ENDS/CONTAINS operators (required);
ERE MATCH operator (optional).

That is, we skip LIKE with all its "unusual" wildcards altogether.

The REs used could be a common subset of expressions supported by ERE and PCRE.

The RE strings can be encoded as OPTiMaDe strings (in that case, the RE backslash will be encoded as two back-slashes in a string), or we can provide a /-delimited specialised RE syntax – in that case, we can use single physical backslash to escape RE meta-characters, improving readability at the expense of extra grammar rules in the Filter language. The recent commits (https://github.com/Materials-Consortia/OPTiMaDe/pull/160/commits/dc3ac45d7019c50c10a6976038e2781f24070463 -- https://github.com/Materials-Consortia/OPTiMaDe/pull/160/commits/6a197fba801aaa6c3f9282f489615807c8ef0b67) show how this can be done in principle.

Rationale:

not supporting LIKE only makes problems for backend engines that implement SQL LIKE but do not implement RE matches. Arguably, there are not many of such engines;
MATCH, if implemented, covers all functionality of LIKE, and more;
when a backend supports (PC|E)REs, the implementation will be simple – just pass the RE to the backend RE engine;
since MATCH is optional, anyone who encounters difficulties implementing RE match can simply skip it;
the mandatory STARTS/ENDS/CONTAINS operators offer some of the LIKE functionality;
since we have only one optional feature, there will be minimal fragmentation and no room for confusion;
setting on EREs allows to reuse the existing ERE standard (POSIX), with increased likelihood that it will be implemented in backends in a compatible way.

Jul 20 '19 11:07 sauliusg

A Regexp survey:

MySQL uses the extended version to support regular expression pattern-matching operations in SQL statements (https://dev.mysql.com/doc/refman/5.7/en/regexp.html)

MongoDB uses Perl compatible regular expressions (i.e. “PCRE” ) version 8.41 with UTF-8 support. (https://docs.mongodb.com/manual/reference/operator/query/regex/)

PostgreSQL supports the same three regular expressions flavors: Tcl Advanced Regular Expressions, POSIX Extended Regular Expressions and POSIX Basic Regular Expressions. (https://www.postgresql.org/docs/9.3/functions-matching.html), (https://www.regular-expressions.info/postgresql.html)

[SQLite] The REGEXP operator is a special syntax for the regexp() user function. No regexp() user function is defined by default and so use of the REGEXP operator will normally result in an error message. If an application-defined SQL function named "regexp" is added at run-time, then the "X REGEXP Y" operator will be implemented as a call to "regexp(Y,X)". (https://sqlite.org/lang_expr.html)

The solution [to get PCRE implementation into SQLite3] was: sudo apt-get install sqlite3-pcre, which implements Perl regular expressions in a loadable module in /usr/lib/sqlite3/pcre.so: /.../ .load /usr/lib/sqlite3/pcre.so (https://stackoverflow.com/questions/5071601/how-do-i-use-regex-in-a-sqlite-query)

A word of caution:

While most regular-expression searches can be executed very quickly, regular expressions can be contrived that take arbitrary amounts of time and memory to process. Be wary of accepting regular-expression search patterns from hostile sources. If you must do so, it is advisable to impose a statement timeout.

LIKE searches, being much simpler than the other two options, are safer to use with possibly-hostile pattern sources. (https://www.postgresql.org/docs/9.3/functions-matching.html)

=> some people may want to implement LIKE but not MATCH for security reasons (to avoid DoS).

Jul 20 '19 12:07 sauliusg

I like your latest suggestions for how to support regexes. I see no issue with implementing regex support first and leave a hypothetical level 2 until later.

But, can you clarify a homework-not-done question: is PCRE a strict superset of functionality of EREs so we can simply say that OPTiMaDe uses the POSIX ERE format, and databases that uses PCRE can then simply plug those ERE expressions into their PCRE engines without issue? Or, do we need to say that we only support the common subset of features supported by both EREs and PCREs? Because for the latter, we will have to do some work to define the precise syntax.

Furthermore, I'm in support (but it is not an overly strong opinion) of reducing the number of backslashes via your suggestion of a regex token on format /... regex pattern.../, OR, alternatively picking up my idea further up in this thread to extend our string tokens to carry meta information about per-character literal-ity. (Which is conceptually a bit more tricky, but would resolve this issue more generally for all kinds of embedded mini-languages.)

Now onto the basic pattern level (level 2):

I'm not quite convinced of the idea of completely dropping the basic pattern level. Two reasons from the top of my mind:

OPTiMaDe support for basic patterns allows backends that do not support regexes (or, at least not, exactly EREs/PCREs), but which can do other decent partial text matching, to provide at least that level of support. (As far as I can find online, an example of such a backend is MS Access.)
Novice users may find the basic pattern syntax easier than the full regex syntax.

I agree with most of your latest comments about level 2 support, so we may be converging on a design here. In particular we seem to agree that:

level 2 and level 3 partial text matching should use segregated syntax.
OPTiMaDe basic pattern support should work equivalently to standard SQL LIKE with ESCAPE \ (but possibly with other wildcard characters.)

So, perhaps then all that is left is to decide is between these alternatives?:

SQL LIKE-based basic patterns with wildcard characters % and _ as it is in that standard.
SQL-LIKE-like basic patterns with wildcard characters * and ?.
Skip basic pattern support, have only regex.

I see good and bad things with each of these options, so I propose to bring this up for discussion at the next meeting.

Jul 20 '19 23:07 rartino

@rartino :

just a quick answer to the ERE<->PCRE compatibility question:

But, can you clarify a homework-not-done question: is PCRE a strict superset of functionality of EREs so we can simply say that OPTiMaDe uses the POSIX ERE format, and databases that uses PCRE can then simply plug those ERE expressions into their PCRE engines without issue? Or, do we need to say that we only support the common subset of features supported by both EREs and PCREs? Because for the latter, we will have to do some work to define the precise syntax.

I have DuckDucked yesterday on this subject and could not find a definitive answer, but it seems that ERE and PCRE are not in the subset-superset relation. The demonstration can be found in character set ([a-z]) behaviour:

in EREs, "An escape inside a character class declaration shall match itself: in other words the escape character is not "special" inside a character class declaration; so [\^] will match either a literal '\' or a '^' " (https://www.boost.org/doc/libs/1_64_0/libs/regex/doc/html/boost_regex/syntax/basic_extended.html);
in the Perl flavour of PCRE, 'Characters that may carry a special meaning inside a character class are: "\", "^", "-", "[" and "]" /.../ The sequences "\a", "\c", "\e", "\f", "\n", "\N{NAME}", "\N{U+hex char}", "\r", "\t", and "\x" are also special and have the same meanings as they do outside a bracketed character class' (man perlrecharclass), which in my understanding is incompatible with the ERE definition.

This is attested by the behaviour of GNU grep:

saulius@koala ~/ $ echo -e 'one\ntwo\none and a half\n1\\2\nTAB\tSEPARATED\nthree' | grep --color -E '[ \t]'
two
one and a half
1\2
three

saulius@koala ~/ $ echo -e 'one\ntwo\none and a half\n1\\2\nTAB\tSEPARATED\nthree' | grep --color -P '[ \t]'
one and a half
TAB	SEPARATED

That is, with the -E (ERE) option, the GNU grep treats \ in square brackets as an ordinary character (so both \ and t are matched), while in the -P (PCRE) mode the \t in the bracketed character set is treated as a code for the TAB character, and is matched accordingly.

So yes, we will have to define a common subset...

(More on you remaining comments later; I think we are converging :)

Jul 21 '19 07:07 sauliusg

I like your latest suggestions for how to support regexes. I see no issue with implementing regex support first and leave a hypothetical level 2 until later.

I agree.

Furthermore, I'm in support (but it is not an overly strong opinion) of reducing the number of backslashes via your suggestion of a regex token on format /... regex pattern.../, OR,

Fine, I will add more tests to check the current grammar more thoroughly.

We also need to decide of the two possible approaches we settle:

do we want the Filter grammar to accept any RE character and any RE escape int the /.../ construct
or do we want it to specify the syntax of the supported REs in full.

The first approach is more flexible – to add a RE extension in the future, you do no need to change the Filter grammar, just amend the documentation. The downside is that now Filter language will permit any sequence of characters in the REs, even if they do not make sense as REs, and rely on (informal) description for the back-end implementers to detect these;

The second approach is more strict – we will have a definite grammar to specify what a filter RE is; implementations accepting this (or compatible) grammars will reject incorrect REs at parsing stage, and all will do it in the same way; and we can attach more formal meaning to RE constructs. However, future extensions would need extension to Filter grammar, an may potentially conflict with the current parsing engine capabilities.

alternatively picking up my idea further up in this thread to extend our string tokens to carry meta information about per-character literal-ity. (Which is conceptually a bit more tricky, but would resolve this issue more generally for all kinds of embedded mini-languages.)

I am not quite sure what you mean by literal-ity here, and I am afraid that this approach already contains conflicting purposes (conflicting == not implementable at the same time in the same syntax).

The current filter strings currently can encode arbitrary printable character sequence (which is probably enough for search language). To make it fully transparent, we can add \n, \t, \012 and \x1BCD style escapes, to specify arbitrary character sequences (both printable and non-printable).

Regular expressions are encoded by those character strings. The strings serve only as a wrapper layer for REs. This means, for example, that \s will not have any special meaning in strings (since it does not stand for any specific Unicode character); but in PCREs the two characters \ and s for a special wildcard \s matching arbitrary space. We could say that \ and a letter s in strings mean just those two characters if it is not a special escape, but what will happen when at some stage one will need \b to denote back-tab control character and it will already be taken by a \ and b character sequence?

We can of course say that in Filter strings any combination of a backslash and a character is allowed, both these characters continue a string even if the second character is a double-quote, and we do not prescribe any interpretation to these characters; i.e. in "some \strange string" the \s pair might mean "match arbitrary space" in REs and "literal s character" in strings... I am not sure if this is a good approach since we leave implementations to interpret the meaning of the escapes.

/.../

I see good and bad things with each of these options, so I propose to bring this up for discussion at the next meeting.

OK.

Jul 21 '19 14:07 sauliusg

OPTIMADE OPTIMADE copied to clipboard

Like operator

OPTIMADE
OPTIMADE copied to clipboard