WebSearch: "OR NOT" bug
A users report on INSPIRE: [...] I'm having some problems related to the Google-like search functionality which I guess are related to parenthesis handling.
As an example take the two following searches:
author:"a valcarce" -author:"catelan, m"
author:"e. hernandez" (salamanca | valladolid)
As far as I know, both are well formed. The first one returns 161 results and the second one 94.
Now we may try the search which combines both with an OR:
(author:"a valcarce" -author:"catelan, m") | (author:"e. hernandez" (salamanca | valladolid))
We obtain 94 results, which makes no sense (at least the first 161 results should satisfy this query). I would be grateful to be told why this is happening and how to solve it, if possible.
Despite having carefully checked my syntax, I was unable to find any error. If that was case, I apologize. [...]
removing the outer right parentheses yields 248 results.
(author:"a valcarce" -author:"catelan, m") | author:"e. hernandez" (salamanca | valladolid)
[...]
@tsgit reports: [...] it's because the search token substitution in the conjunctive normal form mishandles the negation to "and not" instead of "or not" [...] what that means is that
['(', 'p2', '|', 'p0', ')', '+', '(', 'p3', '|', 'p4', '|', 'p0', ')', '+', '(', 'p2', '|', '-', 'p1', ')', '+', '(', 'p3', '|', 'p4', '|', '-', 'p1', ')']
results in
['+', 'author:"e. hernandez" | author:"a valcarce"', '+', 'salamanca | valladolid | author:"a valcarce"', '+', 'author:"e. hernandez" - author:"catelan, m"', '+', 'salamanca | valladolid - author:"catelan, m"']
so '|' '-' becomes '-' ?
that's a bug [...] this also means that the much simpler search
valladolid | -author:"catelan, m"
does not return the same result as
-author:"catelan, m" | valladolid
[...]
This stems from the fact that nested expressions weren't first-class citizens in the original query parser, so to speak. While expressions like "foo AND NOT bar" can be reduced to linear L2R processing by transforming them to "foo -bar", expressions like "foo OR NOT bar" require an "invisible" parentheses like "foo OR (everything NOT bar)".
This should not be a problem in Invenio 2.1 code base thanks to the invenio-query-parser package where nested expressions are first-class citizens, so to speak (untested). In the legacy code base we can fix the problem by introducing special treatment for "OR NOT". I can take care of this bug for maint-1.2 as well, because this affects last stable release too.