Alex Klibisz comments

Results 158 comments of


                                            Alex Klibisz

Can't create vector fields by Elasticsearch index templates

If you have any stack traces from the elasticsearch node, that would also be very helpful.

Optimize top-k counting for approximate queries

Hi @mikemccand, thanks for the reply. As a side note, I've found many of your articles very helpful! > Hmm, why is `Self time` so high in your profiler output?...

Optimize top-k counting for approximate queries

Some more notes-to-self for when I get back to this: Here are the VisualVM hotspots from running the SIFT benchmark (1M stored vectors, 10k total queries) on a local ES...

Optimize top-k counting for approximate queries

Thanks again for digging into this a bit. > The countHits method looks fine, though you should not iterate to docs.cost() Good to know, I'll fix that. > You might...

Optimize top-k counting for approximate queries

Maybe there's a clever stopping criteria to visiting all of the terms? I started reading about MaxScore and WAND scoring. Maybe that's dead end here?

Optimize top-k counting for approximate queries

Put together a working early-stopping solution today. Roughly the idea is: - Compute the total number of term hits up front by iterating over the terms enum once. - Iterate...

Optimize top-k counting for approximate queries

> That sounds promising! Do you take the docFreq (or maybe totalTermFreq) of terms into account? E.g., collecting all term + docFreq from the TermsEnum, then sort them in increasing...

Optimize top-k counting for approximate queries

I'm afraid the early-stopping method as I described it isn't going to work. Specifically, it's pretty easy to find a case where a single vector matches for multiple consecutive hash...

Optimize top-k counting for approximate queries

> OK, sigh :) > > I still think you should explore indexing "sets of commonly co-occurring hashes" if possible. If your query-time vectors are such that the same sets...

Optimize top-k counting for approximate queries

I'm trying to make precise what advantage indexing co-ocurring hashes would have. If we assume that Lucene's retrieval speed is maxed out, the next obvious speedup is to somehow decrease...