tantivy
tantivy copied to clipboard
Better Snippet Scoring
Is your feature request related to a problem? Please describe. During discussion in issue #916 it was brought up by @fulmicoton that current snippet scoring only takes into account how many times search terms show up in a fragment. This could be improved.
Describe the solution you'd like More factors should be taken into account when scoring snippets. Some ways to adjust scores would be:
- Weigh the first time a term appears in a document higher. The first time a term is used is more likely to be a definition, or the start of a section that the user is interested in.
- If a snippet has multiple seperate terms, it should score higher. For example, say that we have the search terms "flour", "water", and "sugar". The snippet "add the sugar and water to the flour" should have a higher score than "flour, flour, flour, flour everywhere!". Currently the second might have a higher score.
- Currently terms are weighted as
score = 1.0 / (1.0 + doc_freq as Score);This is OK, but I'd like to try weighting byscore = -ln(1.0 / (1.0 + doc_freq as Score));. This weight would be closer to the IDF, and feels more "information theory"y (E = -k * ln(p)).
[Optional] describe alternatives you've considered I'm open to other ideas here, and I don't have a clear idea of how much the factors above should affect the score. Currently I am going to try messing around with params in my own project until I find something that feels good.