tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

Document Score

Open petr-tik opened this issue 6 years ago • 3 comments
trafficstars

I have been asked about the scoring algorithm that tantivy uses and realised that neither I, nor the documentation have a canonical description for it apart from:

The larger the number, the more relevant the document to the search

https://docs.rs/tantivy/0.10.3/tantivy/type.Score.html

I think it will be great to add more information and run through an example query on an index to show why queries return results in that order and how a user might debug specific queries.

Who do we expect to read this?

People building a full-text search engine are interested in efficiently storing and ranking documents against queries. The score of each document is arguably THE most important data type that we return to users in every query. I expect most users of tantivy will want to read about the Score type at one point or another.

2 types of users:

  1. knowledgeable about building search engines and wants to confirm the validity of tantivy's scoring algorithm - expect to see tf/idf, BM25 and other known
  2. someone for whom tantivy might be the first experience building a search application with little background on document scoring - want answers to specific questions and some further reading material.

Questions these users want to answer:

  • [ ] Why are search results in this order? What is this score field? Why is it a float?
  • [ ] How does each subquery in the full query (eg. q: "title:president AND (body:Obama OR body:barack) AND year:<2008") contribute to the final score of a document
  • [ ] I want to boost/expected a specific document higher up in the set of results for a given query - how do I do that?

Suggested style of documentation

Prose: A detailed high-level explanation for document scoring - how is each query scored, how are scores of different sub-queries combined. Code: doc-test (doesn't need to assert/test anything) that walks through an example of debugging a unexpectedly low-ranking document, using Query::explain and showing how the example query can be re-written.

Provide further reading material

Give links to tf-idf, BM25 wikipedia pages and the Query::explain method

If you do this ticket, you will learn:

  • The full life-cycle of a tantivy query from query to score per document
  • tantivy helper methods for debugging such queries
  • writing concise, yet informative documentation for power-users and amateurs at the same time

petr-tik avatar Nov 14 '19 00:11 petr-tik

hey @jeffsmith82, Thought you might find this ticket interesting.

appreciate you may have been busy recently, so let us know, if you have little bandwidth to do this.

petr-tik avatar Nov 14 '19 00:11 petr-tik

Uh, was this ever implemented @petr-tik?

safwansamsudeen avatar Jan 29 '24 09:01 safwansamsudeen

I don't think this is properly documented

PSeitz avatar Jan 29 '24 11:01 PSeitz