tntsearch icon indicating copy to clipboard operation
tntsearch copied to clipboard

Expression::lex parses negation inconsistently

Open spud opened this issue 3 years ago • 0 comments

I've just tried out my first implementation of TNTSearch, so bear with me!

I'd been struggling with strange results from boolean searches using the "foo -bar" syntax, seeing results that were clearly inaccurate. Glancing at the source code, I noticed that the tilde (~) was also used for excluding words, so I tried the same query using "foo ~bar", expecting the same result set, but got totally different (and more accurate) results.

While debugging, I noticed that the output produced by Expression::lex was different in the two cases.

$ex = new Expression(); $tokens_1 = $ex->lex("foo -bar"); $tokens_2 = $ex->lex("foo ~bar");

The problem is $tokens_1 != $tokens_2

That simple inconsistency is the basic bug for this report. But I am aware of #246, and I cannot speak to whether or not this fix might address any aspect of that issue. I do know that $tokens_1 was producing wildly inaccurate results, and $tokens_2 produced much better matches, so there is definitely a difference in the results they produce.

A quick look at the code for lex seems to indicate that the inconsistency in parsing can be rectified by changing the initial search and replace arrays into a different order: $bad = [' or ', ' ', '-']; $good = ['|', '&', '~'];

This ends up producing the same token array in both situations. I'm just not familiar enough with the implications of that change (it's consistent, but is it right?) to go straight to a pull request. (But happy to if this is confirmed.)

spud avatar Jan 31 '22 05:01 spud