invenio icon indicating copy to clipboard operation
invenio copied to clipboard

WebSearch: "OR NOT" bug

Open kaplun opened this issue 10 years ago • 1 comments

A users report on INSPIRE: [...] I'm having some problems related to the Google-like search functionality which I guess are related to parenthesis handling.

As an example take the two following searches: author:"a valcarce" -author:"catelan, m" author:"e. hernandez" (salamanca | valladolid)

​​As far as I know, both are well formed. The first one returns 161 results and the second one 94.

Now we may try the search which combines both with an OR: (author:"a valcarce" -author:"catelan, m") | (author:"e. hernandez" (salamanca | valladolid))

We obtain 94 results, which makes no sense (at least the first 161 results should satisfy this query). I would be grateful to be told why this is happening and how to solve it, if possible.

Despite having carefully checked my syntax, I was unable to find any error. If that was case, I apologize. [...]

removing the outer right parentheses yields 248 results. (author:"a valcarce" -author:"catelan, m") | author:"e. hernandez" (salamanca | valladolid)

[...]

@tsgit reports: [...] it's because the search token substitution in the conjunctive normal form mishandles the negation to "and not" instead of "or not" [...] what that means is that

['(', 'p2', '|', 'p0', ')', '+', '(', 'p3', '|', 'p4', '|', 'p0', ')', '+', '(', 'p2', '|', '-', 'p1', ')', '+', '(', 'p3', '|', 'p4', '|', '-', 'p1', ')']

results in

['+', 'author:"e. hernandez" | author:"a valcarce"', '+', 'salamanca | valladolid | author:"a valcarce"', '+', 'author:"e. hernandez" - author:"catelan, m"', '+', 'salamanca | valladolid - author:"catelan, m"']

so '|' '-' becomes '-' ?

that's a bug [...] this also means that the much simpler search

valladolid | -author:"catelan, m"

does not return the same result as

-author:"catelan, m" | valladolid [...]

kaplun avatar May 19 '15 07:05 kaplun

This stems from the fact that nested expressions weren't first-class citizens in the original query parser, so to speak. While expressions like "foo AND NOT bar" can be reduced to linear L2R processing by transforming them to "foo -bar", expressions like "foo OR NOT bar" require an "invisible" parentheses like "foo OR (everything NOT bar)".

This should not be a problem in Invenio 2.1 code base thanks to the invenio-query-parser package where nested expressions are first-class citizens, so to speak (untested). In the legacy code base we can fix the problem by introducing special treatment for "OR NOT". I can take care of this bug for maint-1.2 as well, because this affects last stable release too.

tiborsimko avatar May 19 '15 08:05 tiborsimko