tntsearch
tntsearch copied to clipboard
Incomplete Results for Strings that Include Numbers
Moving this issue from https://github.com/trilbymedia/grav-plugin-tntsearch/issues/122 to here after getting verification that it's not a Grav issue.
Here are the details:
A client site uses part numbers in page titles and content (e.g., "SPK1000", "7393 Horn Driver") and TNTSearch isn't returning all matches when the minimum (three) characters are entered.
Test case 1 is a search for "spk", which should return "spk1000" and "spk7457", but only the first appears:
A search for "spk7", returns "spk7457", which should also appear in the previous search:
Test 2 is a search for "739", which should return three results - two instances of "7393 Horn Driver" and 1 with "739" in the body of the text, but instead only returns the latter:
A search for "7393" turns up the first two expected above (two instances of "7393 Horn Driver"):
Using the Test 2 "739" search documented above, with index rebuild + cache clear between tests, I tried the following configuration changes with no luck:
- fuzzy enabled and then disabled, no change
- phrases enabled and then disabled, no change
- search type of "auto" and then "basic", no change
- stemmer enabled and then disabled, no change
By default TNTSearch operates on full words. It does return only first word when using partial search, i.e. "spk1000" and "spk7457" are threated as completely different words so only first result is returned.
This is true for all words and it doesn't matter if they contain numbers or not. However for "normal" words this is alleviated with stemmer, which doesn't work for numbers obviously. According to this comment the only way to partially search in words with numbers is fuzzy search.
I have looked at fuzzy search code and it is complete mess unfortunately :(.
- Fuzzy search is never called if at least one full match is found. We even have a pull request to fix this.
- The default levenstein distance of 2 is not enough in your case. Even if we change the default distance there still be cases where fuzzy search doesn't include some results. That's because leveinsten argorithm wasn't created for partial search. It was created to search similar words, which in most cases are also full "normal" words.
- Some parts of the code is never reached (?) at all, like here for example https://github.com/teamtnt/tntsearch/blob/4225e1290b7d77bdfe0fcda3cefa895e4bf2282a/src/TNTSearch.php#L251
I guess the correct way to fix this somebody needs to implement partial search algorithm. This would also enable partial search for languages which do not have stemmer.
You can also try the patch in mentioned pull request, if it changes things for you. Hopefully that helps a bit.