covidex icon indicating copy to clipboard operation
covidex copied to clipboard

Discussion: faceting and pagination in multistage ranking

Open lintool opened this issue 4 years ago • 0 comments

Here are some of my thoughts about faceting and pagination in multistage ranking.

tl;dr - it's not clear to me what the "correct" implementation is... see details below for full discussion.

The standard mental model of faceting is as a slice of the entire collection, i.e., how many documents contain that facet. This is the intuitive user expectation, and works "as expected" with pagination, i.e., when the user clicks "next page", the search engine fetches the next page of results that contain the facet, until we run out of results. This is exactly what Blacklight does.

However, when we move to a reranking architecture, it's a bit unclear what the system should do. The simplest implementation would be to provide faceted browsing on the initial candidate list. That is, the initial retrieval returns 1k hits, system reranks, and the faceted browsing is on those 1k hits.

This is fine, but problematic from both the perspective of faceted browsing and pagination:

  • From the perspective of pagination, users are accustomed to seeing results of fixed sizes: first 10 hits, second 20 hits, etc. Since we're already retrieving and reranking top 1k hits, it makes no sense for us to paginate. But, what happens if the user wants more hits? Do we retrieve the next 1k raw hits, and then rerank those? This has the downside that each results page would contain a different number of hits. Also, under the paragraph condition, we'd have to dedup wrt previous hits, which means having to keep track of state, which means a more complex implementation, etc.

  • From the perspective of faceting, the implementation outlined above divergences from user expectations. Say we facet only on reranked results - and thus the interface shows only the matching hits. The user scrolls to the bottom and wants more hits. Obviously, we can go back and fetch more hits (with all the complexities above), but then the facet count isn't accurate...

These are important considerations, since systematic reviews, one of the use cases for our system, needs high recall and thus may require going deep into hit lists. For example, this metareview examines over 1300 articles. Faceted browsing, I imagine, would be helpful for systematic reviews also. We don't have a RCT facet right now, but if we had one, I think it'd be used quite a bit.

So, it's not clear to me what the "correct" implementation is...

Thoughts?

lintool avatar Apr 24 '20 13:04 lintool