Adrien Grand
Adrien Grand
I'll merge once https://github.com/apache/lucene/issues/14630 is resolved.
@mikemccand I'm curious if you're allowed to share how many candidate hits are fetched from Lucene before being fed to rescorers on amazon.com?
> How do you differentiate different encodings in Lucene? Is it stored as extra metadata? It extends the int8 flag, which previously recorded the number of bits per value. Positive...
> Also, would N take into account if Panama Vector is enabled and if things are quantized or not? FWIW I'd optimize for simplicity over picking the perfect heuristic. As...
> Maybe we could do something similar here and only build a HNSW graph if doing a top-1000 search would visit less than 1/8th of the documents that have a...
> I would suspect it is somewhere around 10k. Interestingly, it looks like your intuition roughly aligns with my suggestion if using topK=100. `expectedVisitedNodes(100, 10_000) = 921 ~= 1250 =...
> allow it to be configurable?? maybe that is too many knobs I worry about too many knobs too, I'd hardcode it. > I would expect this to improve indexing...
We have [`StoredFieldsBenchmark`](https://github.com/mikemccand/luceneutil/blob/b3d5216dfd82bb28dcff699f9d904b8b03d8d116/src/extra/perf/StoredFieldsBenchmark.java) to test the impact of NRT (frequent small flushes, frequent small merges) on stored fields, we could write something similar for vectors.
Facets already put the burden of choosing between taxonomy and doc-value-based faceting on users. If we introduce a new approach for faceting, I worry that it would make things even...
Can you clarify which allocation is the problematic one, and where it's done on the indexing path?