lucene Promote sandbox facets to the main facets module

Last year we created functionality to compute facets during hits collection. As the functionality is experimental, we added it to the sandbox module.

We want to move the code from the sandbox to the main facets module, as a separate package.

This is a "parent" issue to discuss / track missing pieces before the promotion can happen, as well as the promotion itself. To make it easier to discuss subtasks, I've created a google doc. The doc consists of 3 parts: promotion from the sandbox, making it feature complete, and follow up tasks. This github issue is for the first part only.

May 06 '25 15:05 epotyom

Facets already put the burden of choosing between taxonomy and doc-value-based faceting on users. If we introduce a new approach for faceting, I worry that it would make things even worse: if a user wants to compute facets in their application, what should they use?

I personally like the new faceting approach better, in particular it doesn't use O(maxDoc) heap to store a bit set, it allows collectors to give feedback to the query about which docs they care about (LeafCollector#competitiveIterator()). But I'm also not as familiar with the faceting module as @gsmiller or @mikemccand, so I'm curious: is there consensus that this new approach for faceting should eventually replace the existing one, or do we anticipate both to keep developing and serve different purposes?

May 11 '25 20:05 jpountz

In my opinion, the new approach can eventually do everything the current approach does, but there are quite a few gaps to cover, see Milestone 2 in the plan document. Whether or not we want to deprecate the old functionality after that is a good question. The only benefit of pre-collecting to docId sets I know is that in theory user can do something like find top 1 book author (with taxonomy facets) and then count docs for price ranges for matching books of this author by reusing the docID set + fastMatchQuery . I don't know if anyone actually does something like that. Also, we can implement similar functionality for the new approach by making it compatible with pre-collected docID sets, I've just added the task to the Milestone 2.

The other potential concern is performance. While in general the new approach seems to be more efficient as it doesn't require intermediate docID sets, there are some cases where the old approach is faster, e.g. for taxonomy when user counts for MatchAllDocs query for a facet index field that is responsible for creating majority of taxonomy facet labels, see luceneutil #325 for details. Although, I think we can find a way to optimize CountFacetRecorder for dense counting. Another example, the implementation for long values facet counts for the new approach is also very inefficient, although Milestone 0 has an idea to try that can make it faster.

Just to summarize, what I guess I'm saying is that eventually the new approach can replace the old one, but it will take time.

May 13 '25 09:05 epotyom