just-the-docs icon indicating copy to clipboard operation
just-the-docs copied to clipboard

Search (lunr) is buggy

Open hansfn opened this issue 3 years ago • 1 comments

Describe the bug Words in search index isn't matched while searching.

To Reproduce Created a dummys site with just one page that resulted in the following search index:

{"0": {
    "doc": "Index page",
    "title": "Index page",
    "content": "# Index page Just testing ",
    "url": "https://hansfn.github.io/",
    "relUrl": "/"
  }
}

The default search config is used.

Searcing for "Just" doesn't work. Searcing for "testi" works, but not "testin". (Interactive usage - the matched page is removed when entering the last "n".)

Until I modify my dummy site, it can be tested on https://hansfn.github.io/ - source on https://github.com/hansfn/hansfn.github.io

Expected behavior

  1. Any whole word in the index should match
  2. Extending a substring match shouldn't remove the match.

Solution Upgrade lunr.js? I notice that version 2.3.6 (from 2019) is still used. The current version is 2.3.9.

I haven't tested locally if it makes any difference.

hansfn avatar Aug 16 '21 10:08 hansfn

Agreed that this problem is reproducible. A PR is welcome - either to update the lunr.js version (which, if I had to guess, probably doesn't resolve this problem) or to resolve it some other way.

Since this isn't particularly scoped out, I'm going to mark it as "needs investigation" - but a PR is one good way to investigate :)

mattxwang avatar Jun 28 '22 06:06 mattxwang

@hansfn in fact this isn't a bug, it's expected behaviour! The word "just" is one of many stop words that lunr.js filters out when searching. For example, if you replace "Just" by "any", searching for any doesn't find it.

BTW, searching for the word "just" on the theme website doesn't find its occurrences on the home page – the results are all for places where "just" occurs as part of a longer word.

The removal of stop words is mentioned on the lunr.js Core Concepts page, and it's clear in the source code: see https://github.com/olivernn/lunr.js/blob/aa5a878f62a6bba1e8e5b95714899e17e8150b38/lunr.js#L1182 It lists about 120 stop words for English.

pdmosses avatar Sep 04 '22 21:09 pdmosses

@pdmosses great investigation, much appreciated.

Since users are running into this problem - I'm thinking this is a chance to improve our documentation (at the very least, pointing users to the lunr docs for issues they run into). What are your thoughts?

After resolving our next step, we can close this issue.

mattxwang avatar Sep 07 '22 06:09 mattxwang

Thank you, @pdmosses, for the stop word explanation. I should have realized that since I have been using that myself earlier.

However, that doesn't explain the second problem: That matches are removed when a matching search is extended. Test with "writi" on https://just-the-docs.github.io/just-the-docs/ and then add a "n" so you get "writin" - no matches. ("writing" of course matches again.)

hansfn avatar Sep 07 '22 07:09 hansfn

@hansfn I think Lunr applies stemming to the query word as well as the document contents. I'm not familiar with the details of the Porter stemmer used by Lunr, but perhaps it produces different stems from "writi" and "writin":

The stemmer used by Lunr does not guarantee that the stem of a word it finds is an actual word, but all inflections and derivatives of that word should produce the same stem.

Assuming that our theme always shows all the search results produced by Lunr, the sudden unexpected disappearance of some matches looks more like an issue with Lunr than a theme bug that we could address.

pdmosses avatar Sep 07 '22 11:09 pdmosses

@mattxwang re:

Since users are running into this problem - I'm thinking this is a chance to improve our documentation (at the very least, pointing users to the lunr docs for issues they run into).

The first line of our Search docs already contains a link to the Lunr docs. We might add a callout with a warning that stop words are ignored, and that the stemming is algorithmic (hence imperfect).

But I think we should defer all but the most urgent improvements to our docs until after the release of v0.4.0.

pdmosses avatar Sep 07 '22 12:09 pdmosses

Assuming that our theme always shows all the search results produced by Lunr, the sudden unexpected disappearance of some matches looks more like an issue with Lunr than a theme bug that we could address.

Yes, you are correct. Closing the issue is fine by me. It's documented by our discussion ;-)

hansfn avatar Sep 08 '22 06:09 hansfn