Nouveau - Facet counts should be ordered
Summary
Say I have a faceted field that contains tens of thousands of values, when querying with counts the resulting values are limited to the first 10 but without any apparent order. Without the ability to select the order, one cannot be sure if those results are the most relevant ones for the query. It is unreasonable to traverse all facet results (by increasing top_n ad-infinitum) in order to find those with the greatest count or to have an alphabetically sorted list of facets.
Desired Behaviour
Modify the counts parameter so instead of it being an array of strings, it becomes an array of objects with the form:
"counts": [
{
"field": "subject",
"sort": "count"
}
]
Additional context
I've been using and modifying DSpace for some years now and the main issue I have with it is its reliance on SOLR for full-text searches and faceting. In my experience, SOLR is faulty; so I decided to create my own de-bloated version of a digital repository using CouchDB.
Hi, thanks for the report. Will get on this, I agree with you.
Hi, I've reviewed the facet code in nouveau (https://github.com/apache/couchdb/blob/main/nouveau/src/main/java/org/apache/couchdb/nouveau/lucene9/Lucene9Index.java#L346) and it is already returning the "top" children. I've looked at other methods on the Facets class and I think getTopChildren is the right one here. I don't see a way to choose an ordering other than by count.
Hi Robert,
I think the concept of "top" might not mean "top count". Here I query a database with 40k records with the following request:
And the resulting "counts" object is this:
Could it be that the results are being ordered by relevance?
My concern is that, since there is no apparent order, I cannot reliably know if, in my example, "Danza moderna" is the top result for palabras_clave.
I can confirm that "top" here means "top count". Reading the implementation of getTopChildren it uses https://lucene.apache.org/core/10_2_2/facet/org/apache/lucene/facet/TopOrdAndIntQueue.html and passes in the counts of the facets collected, keeping up to topN of the children with the highest count.
Note that this is still the counts of things that matched your query.
If that is correct, I note that the results are ordered in reversed alphabetical order.
They are objects so the returned order is not meaningful, I think the apparent reverse alpha order is an artifact of the code that merges the facet results from the underlying shards.
Potentially https://issues.apache.org/jira/browse/LUCENE-10614 is related? We'll be updating to Lucene 10 soon, and perhaps that fixes this issue.