lucene
lucene copied to clipboard
Regarding the frequency used for scoring sloppy phrase queries.
Description
I've been using Lucene (through OpenSearch) for querying and scoring human-written documents (!= logs). I often use sloppy phrase queries to handle languages variations which express similar meaning.
For example, one of my query contained "frauduleus faillissement"~3
(in Dutch. This is equivalent to "fraudulent bankruptcy" in English). Thanks to the sloppy match, the following two sentences are correctly matched by that query:
- Dit faillissement is frauduleus (This bankruptcy is fraudulent)
- Dit is een frauduleus faillissement (This is a fraudulent bankruptcy)
While the two sentences are very similar in meaning, the way they are scored by the SloppyPhraseScorer
is very different. According to the following lines the sentences will have respectively a frequency of 0.25
and 1
, leading to a large difference in the resulting relevance score.
https://github.com/apache/lucene/blob/3ce9ba9fd51a9b4e7228d81e19acbdb8b18f4e12/lucene/core/src/java/org/apache/lucene/search/SloppyPhraseMatcher.java#L166-L169,
While I understand the intent of sloppyWeight()
to penalize sloppy matches that are different from the exact match, I feel that the penalty is way too strong. As a user, I deliberately make the choice to look for sloppy phrase matches and I wouldn't expect that such a strong penalty would apply. For that particular example, I would obtain a more accurate scoring by using multiple exact phrase queries.
Browsing the history of SloppyPhraseScorer
, I see that this scoring approach has been in place for years and I did not find any issue discussing that implementation. Even though my use case might be niche, I believe a revision of that scoring method could greatly benefit applications on human-written texts.
I agree that the penalty feels too high. We have challenges with queries like this one because there is no good theoretical basis for the right way to score such queries (at least to my knowledge), which forces to make guesses. The one that is implemented at the moment is certainly not great, but at least it's parameter-free. I would imagine that something like k / (k + matchLength)
would work better for you but then it requires us to expose a configuration option which isn't great either.
That's my personal opinion, but given that it's up to the user to allow sloppy matches, I question why the score penalty even exists. If a user wants to rank higher exact matches, it's easy to do so by boosting a more restrictive query.
Now, if the penalty is there to stay, I would argue that an exponential decay function would be more suitable for the penalty weight. The penalty would decrease with a fixed factor, unlike the current implementation which is penalize very harshly small slops but not so much bigger ones.
This could be implemented, for example, having this kind of weight: Math.pow(k, matchLength)
with k
fixed to, say, 0.95
for a parameter-free implementation.
See a comparison of penalty weights for the current implementation and the exponential decay:
1f / (1f + matchLength)
1.0, 0.5, 0.333, 0.25, 0.2, 0.167, 0.143, 0.125, 0.111, 0.1
Math.pow(0.95, matchLength)
1.0, 0.95, 0.902, 0.857, 0.815, 0.774, 0.735, 0.698, 0.663, 0.63