lunr.js icon indicating copy to clipboard operation
lunr.js copied to clipboard

Explain these results

Open dkijkuit opened this issue 6 years ago • 8 comments

Can someone please explain these search results? I have multiple fields in my index but I only want to search the 'title' field so I created this query:

var searchText = "business";
var results = this.mainIndex.query(function (q) {
     q.term(searchText, { fields: ["title"] })
});

The first two results for the title are in the following order:

  • Manage business processes
  • Business to Business [B2B]

Why does 'Manage business processes' show up first? Shouldn't the second result be on top? Because the second result contains the search term twice?

dkijkuit avatar Oct 20 '17 10:10 dkijkuit

Could you post a reproduction somewhere like jsfiddle?

It's difficult to say why you are getting those results without seeing how the index was defined and without the actual data. From what you've shared I would expect to see "Business to Business [B2B]" to rank above "Manage business processes" though.

olivernn avatar Oct 22 '17 08:10 olivernn

Yes, I've created an example: https://jsfiddle.net/jg7fugfw/7/

dkijkuit avatar Oct 22 '17 13:10 dkijkuit

Thanks for the fiddle, makes it much easier to figure out what is going on.

Looking at the results, both documents have the same score, which would be unexpected, but then I took a closer look at the data, the second title contains a misspelling: "Busisess" instead of "Business", so the results are then expected, since both documents only have a single "business" in the title.

Updating the example to include a correctly spelled "Business" twice in the title for document 2 leads to the results I would expect.

olivernn avatar Oct 23 '17 18:10 olivernn

I'm going to close this issue now as I don't think there is anything that needs to be done in Lunr. Feel free to leave a comment if you have any other related questions.

olivernn avatar Oct 23 '17 19:10 olivernn

Sorry my mistake, but I've changed the example to the exact case I am using in my code. And now it shows the incorrect results again. I've changed the body field, but that field should be ignored because I specifically search in the title field. Can you please have a look again?

https://jsfiddle.net/jg7fugfw/22/

Other than that, thanks for developing this great plugin!!

dkijkuit avatar Oct 24 '17 09:10 dkijkuit

Yeah, that is weird, I wouldn't have expected those results given those documents, I'll take a closer look...

olivernn avatar Oct 24 '17 14:10 olivernn

I've poked around at the data in the index you provided to try and understand the results you are getting. I think the problem is due to term saturation, I'll try and explain what is happening and why Lunr is assigning the scores it is.

The term "business" is very common in your example set of documents, it occurs 9 times across all fields. Ordinarily a term appearing many times in a document would indicate that that term is very relevant to that document. However, this can lead to problems where longer documents would dominate shorter documents just by having more of a particular term, even if it isn't very relevant to that document.

Lunr tries to accommodate this using term frequency saturation, which just means that after a certain point, more occurrences of a term has less and less affect on the relevancy score for that term. This saturation level can be tuned using the k1 property when building the index. In your example you can experiment with this value to see the affect on the score. Unfortunately, there does not seem to be any value that will switch the order of the results you are seeing.

There may be improvements possible in Lunr that might improve this case, specifically making the calculation for IDF field aware, though its not clear that it would 'fix' this exact case.

In conclusion, trying to represent a terms relevance to a given document in a single number is inherently a lossy process, providing a catch all algorithm for doing so is always going to fall short in some edge cases. Relevance is a hard problem, even more so with the limited resources available to Lunr running in your browser. I know that doesn't really 'solve' your problem, but hopefully it makes it slightly easier to understand the results. If you're interested in the topic there is an interesting article about the scoring method used in Lunr (but from a Lucene perspective), and a specific section on term saturation.

I'll continue to dig into cases like this, but I doubt there is any quick fix available unfortunately.

olivernn avatar Oct 29 '17 10:10 olivernn

Thanks for the explanation, I understand the complexity of dealing with this specific case. Wouldn't it be possible to completely ignore the 'body' field in this case? Because the search term specifically chooses to search in the 'title' field. In that way you could search in documents on specific characteristics rather than the whole index.

dkijkuit avatar Oct 31 '17 08:10 dkijkuit