tantivy
tantivy copied to clipboard
Phrase query: scale score on used slop
A few points to address:
- the actual scoring function:
- I've gone for ~~sloppy maths~~ a heuristic: https://github.com/quickwit-oss/tantivy/pull/1414/files#diff-6778ebdb19b3c516dabd923da7eff56076630ce5e4cd7804f28dc43265751661R139-R149 the score is scaled by $1 - {\text{used slop} \over (\text{slop}+1)*\text{phrase count}}$ which garantees a scaling factor in $]0;1]$
- there are probably a lot of other options. I haven't looked into the BM25 implem for instance. The resulting intersection after
phrase_match
is executed is the vector of matching positions which could also be of use.
- the implem:
- the first commit is a quick hack/POC
- the second one is a tentative to have a better implem, It creates a
PhraseScorer
trait as well as two ImplemsExactPhraseScorer
andSlopPhraseScorer
closes #1392
Codecov Report
Merging #1414 (592f3c1) into main (c127367) will increase coverage by
0.00%
. The diff coverage is95.74%
.
@@ Coverage Diff @@
## main #1414 +/- ##
========================================
Coverage 94.30% 94.31%
========================================
Files 236 239 +3
Lines 43655 44470 +815
========================================
+ Hits 41169 41941 +772
- Misses 2486 2529 +43
Impacted Files | Coverage Δ | |
---|---|---|
src/query/phrase_query/exact_phrase_scorer.rs | 93.44% <93.44%> (ø) |
|
src/query/phrase_query/slop_phrase_scorer.rs | 95.41% <95.41%> (ø) |
|
src/query/phrase_query/phrase_weight.rs | 87.25% <99.14%> (+12.00%) |
:arrow_up: |
src/query/phrase_query/mod.rs | 93.04% <100.00%> (+0.12%) |
:arrow_up: |
src/query/phrase_query/phrase_scorer.rs | 100.00% <100.00%> (+11.45%) |
:arrow_up: |
src/schema/flags.rs | 47.36% <0.00%> (-8.89%) |
:arrow_down: |
src/core/searcher.rs | 82.30% <0.00%> (-6.39%) |
:arrow_down: |
src/store/reader.rs | 82.26% <0.00%> (-4.94%) |
:arrow_down: |
src/fastfield/reader.rs | 87.80% <0.00%> (-1.28%) |
:arrow_down: |
... and 55 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update c127367...592f3c1. Read the comment docs.
You may have already read it, just in case, here is one of the lucene implementation: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/SloppyPhraseMatcher.java
You may have already read it, just in case, here is one of the lucene implementation: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/SloppyPhraseMatcher.java
I gave a quick look. We're doing things differently it seems cf their doc: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/SloppyPhraseMatcher.java#L36-L52. They seem to score on the inverse of largest phrase match size. I haven't looked enough in the implem to 100% confirm how their Matcher works but it does not look like they take into account multiple matches givent the comments?
In our case because we count multiple matches, we could score on $\text{phrase count}\over\text{match length}$. I think we should be able to modify the implem to do so-> quickly put $\text{match length} = \text{phrase count} * \text{num terms} + \text{used slop}$.
@fulmicoton little ping on this one 🙏