tantivy
tantivy copied to clipboard
Document Score
I have been asked about the scoring algorithm that tantivy uses and realised that neither I, nor the documentation have a canonical description for it apart from:
The larger the number, the more relevant the document to the search
https://docs.rs/tantivy/0.10.3/tantivy/type.Score.html
I think it will be great to add more information and run through an example query on an index to show why queries return results in that order and how a user might debug specific queries.
Who do we expect to read this?
People building a full-text search engine are interested in efficiently storing and ranking documents against queries. The score of each document is arguably THE most important data type that we return to users in every query. I expect most users of tantivy will want to read about the Score type at one point or another.
2 types of users:
- knowledgeable about building search engines and wants to confirm the validity of tantivy's scoring algorithm - expect to see tf/idf, BM25 and other known
- someone for whom tantivy might be the first experience building a search application with little background on document scoring - want answers to specific questions and some further reading material.
Questions these users want to answer:
- [ ] Why are search results in this order? What is this score field? Why is it a float?
- [ ] How does each subquery in the full query (eg. q: "title:president AND (body:Obama OR body:barack) AND year:<2008") contribute to the final score of a document
- [ ] I want to boost/expected a specific document higher up in the set of results for a given query - how do I do that?
Suggested style of documentation
Prose: A detailed high-level explanation for document scoring - how is each query scored, how are scores of different sub-queries combined.
Code: doc-test (doesn't need to assert/test anything) that walks through an example of debugging a unexpectedly low-ranking document, using Query::explain and showing how the example query can be re-written.
Provide further reading material
Give links to tf-idf, BM25 wikipedia pages and the Query::explain method
If you do this ticket, you will learn:
- The full life-cycle of a tantivy query from query to score per document
- tantivy helper methods for debugging such queries
- writing concise, yet informative documentation for power-users and amateurs at the same time
hey @jeffsmith82, Thought you might find this ticket interesting.
appreciate you may have been busy recently, so let us know, if you have little bandwidth to do this.
Uh, was this ever implemented @petr-tik?
I don't think this is properly documented