tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

Phrase query: scale score on used slop

Open saroh opened this issue 2 years ago • 4 comments

A few points to address:

  • the actual scoring function:
    • I've gone for ~~sloppy maths~~ a heuristic: https://github.com/quickwit-oss/tantivy/pull/1414/files#diff-6778ebdb19b3c516dabd923da7eff56076630ce5e4cd7804f28dc43265751661R139-R149 the score is scaled by $1 - {\text{used slop} \over (\text{slop}+1)*\text{phrase count}}$ which garantees a scaling factor in $]0;1]$
    • there are probably a lot of other options. I haven't looked into the BM25 implem for instance. The resulting intersection after phrase_match is executed is the vector of matching positions which could also be of use.
  • the implem:
    • the first commit is a quick hack/POC
    • the second one is a tentative to have a better implem, It creates a PhraseScorer trait as well as two Implems ExactPhraseScorer and SlopPhraseScorer

closes #1392

saroh avatar Jul 12 '22 19:07 saroh

Codecov Report

Merging #1414 (592f3c1) into main (c127367) will increase coverage by 0.00%. The diff coverage is 95.74%.

@@           Coverage Diff            @@
##             main    #1414    +/-   ##
========================================
  Coverage   94.30%   94.31%            
========================================
  Files         236      239     +3     
  Lines       43655    44470   +815     
========================================
+ Hits        41169    41941   +772     
- Misses       2486     2529    +43     
Impacted Files Coverage Δ
src/query/phrase_query/exact_phrase_scorer.rs 93.44% <93.44%> (ø)
src/query/phrase_query/slop_phrase_scorer.rs 95.41% <95.41%> (ø)
src/query/phrase_query/phrase_weight.rs 87.25% <99.14%> (+12.00%) :arrow_up:
src/query/phrase_query/mod.rs 93.04% <100.00%> (+0.12%) :arrow_up:
src/query/phrase_query/phrase_scorer.rs 100.00% <100.00%> (+11.45%) :arrow_up:
src/schema/flags.rs 47.36% <0.00%> (-8.89%) :arrow_down:
src/core/searcher.rs 82.30% <0.00%> (-6.39%) :arrow_down:
src/store/reader.rs 82.26% <0.00%> (-4.94%) :arrow_down:
src/fastfield/reader.rs 87.80% <0.00%> (-1.28%) :arrow_down:
... and 55 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c127367...592f3c1. Read the comment docs.

codecov-commenter avatar Jul 12 '22 21:07 codecov-commenter

You may have already read it, just in case, here is one of the lucene implementation: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/SloppyPhraseMatcher.java

fmassot avatar Jul 26 '22 21:07 fmassot

You may have already read it, just in case, here is one of the lucene implementation: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/SloppyPhraseMatcher.java

I gave a quick look. We're doing things differently it seems cf their doc: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/SloppyPhraseMatcher.java#L36-L52. They seem to score on the inverse of largest phrase match size. I haven't looked enough in the implem to 100% confirm how their Matcher works but it does not look like they take into account multiple matches givent the comments?

In our case because we count multiple matches, we could score on $\text{phrase count}\over\text{match length}$. I think we should be able to modify the implem to do so-> quickly put $\text{match length} = \text{phrase count} * \text{num terms} + \text{used slop}$.

saroh avatar Jul 27 '22 21:07 saroh

@fulmicoton little ping on this one 🙏

saroh avatar Aug 26 '22 19:08 saroh