elastiknn Optimize top-k counting for approximate queries

Currently the biggest bottleneck at query time is the countHits method in MatchHashesAndScoreQuery. This counts the number of times each doc in the segment matches one of the query vector's hashes. https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73

AFAIK, Lucene is generally optimized for a small number of terms (e.g. the words in a search query). Elastiknn, on the other hand, can require retrieving doc IDs for tens or hundreds of terms (the hashes of a query vector).

It seems the main thing worth exploring is using a different PostingsFormat, or potentially implementing a custom one. Maybe there's a way to optimize the storage/access pattern for checking a larger number of terms? Maybe there's a way to simply wrap an existing postings format and offer some helpers that cache the expensive method calls?

Some specific numbers:

When running the MatchHashesAndScoreQueryPerformanceSuite, with 100k indexed vectors and 5k search vectors, the VisualVM sampler reports spending ~92% of its time in the countHits method:

When running the ContinuousBenchmark with the SIFT dataset (1M indexed vectors, 1k search vectors), the VisualVM sampler reports spending ~36% of the search thread time in countHits:

Sep 20 '20 16:09 alexklibisz

Hmm, why is Self time so high in your profiler output? What is countHits actually doing?

Is there any way to cluster multiple hashes into a single term during indexing? I.e. do certain subsets of hashes frequently co-occur at search time?

Sep 21 '20 12:09 mikemccand

Hi @mikemccand, thanks for the reply. As a side note, I've found many of your articles very helpful!

Hmm, why is Self time so high in your profiler output? What is countHits actually doing?

To clarify a bit, the first screenshot is of a benchmark that uses Lucene by itself, i.e. no Elasticsearch involved. The second is from inside the search thread in the Elasticsearch JVM. The self times in the first case is proportionally higher, but both are > 50% of the method runtime.

The countHits method is here: https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73

The rough outline of countHits is:

Get an iterator for the terms at the field. There is a single field where the terms (hashes of a vector) are indexed.
Instantiate an ArrayHitCounter. This is basically allocating a short[] with one entry for every doc in the segment. In this benchmark there are 1M docs in the segment. Each time a doc matches one of the hashes, increment the value at its docId in that array. This is in lieu of using a hashmap (docId -> count), which I found to be at least twice as slow as using an array, even when using primitive collections. More on that here: #156
Iterate over the query vector hashes.
For each hash, seek it in the terms iterator, retrieve its postings (docs that match that term).
For each posting, call .increment(docId), which increments the array-based counter like described above.
After those two loops, return the counter.

Perhaps it's taking a while to allocate the array-backed counter? I don't have a good sense for whether that should actually take such a significant amount of time. Even if it does, the only alternatives I can think of are even slower. I can try it out later with a dummy counter implementation that just throws away the counts to get a lower bound. Would appreciate any more input you have about this and #156 !

Is there any way to cluster multiple hashes into a single term during indexing? I.e. do certain subsets of hashes frequently co-occur at search time?

That would be a smart optimization, but I'm not sure there's a way to do that without having the entire dataset up-front, which doesn't fit the model of the plugin (index/update/delete vectors just like you would index/update/delete traditional docs).

Sep 21 '20 15:09 alexklibisz

Some more notes-to-self for when I get back to this:

Here are the VisualVM hotspots from running the SIFT benchmark (1M stored vectors, 10k total queries) on a local ES JVM using the ArrayHitCounter:

Here are the VisualVM hotspots from the same configuration, but using a dummy counter that just returns the same count and maintains no internal state (i.e. no memory allocated/updated for counting):

I made sure to start the sampling right when it started running queries and stop right after it finished queries. Otherwise the percentages get skewed as the search thread continues running.

Sep 22 '20 14:09 alexklibisz

Hi @mikemccand, thanks for the reply. As a side note, I've found many of your articles very helpful!

Thanks, I am glad to hear that :)

The countHits method looks fine, though you should not iterate to docs.cost() but rather iterate until you see NO_MORE_DOCS returned. In general the cost method is not guaranteed to be a precise count of how many hits this PostingsEnum will return, though for the default Codec that appears to be the case for now.

You are reusing your TermsEnum and PostingsEnum which is great. You might want to pre-sort the Hash in unicode order ... it might give a tiny improvement since the TermsEnum can share more internal state on each seekExact.

Your countHits method is very similar to what Lucene's MultiTermQuery.rewrite does when there are many terms for the non-scoring case, except that method accumulates into a Lucene bitset rather than int[] since the Lucene Query is just concerned with knowing which documents matched, not how many terms matched each hit.

I can try it out later with a dummy counter implementation that just throws away the counts to get a lower bound.

Measuring just the iteration and no count incrementing is a good idea. Oh I see, you just concurrently did that, as I am here commenting, heh. Oddly, it looks like Self time increased with the dummy counter (28.7% to 31.4%)? And the overall time dropped a bit ... this is what I would expect: Lucene is working hard to iterate through all those matched docids.

That would be a smart optimization, but I'm not sure there's a way to do that without having the entire dataset up-front, which doesn't fit the model of the plugin (index/update/delete vectors just like you would index/update/delete traditional docs).

Yeah, it is not simple to implement, and would require that you have some way to tell which hashes frequently co-occur during indexing. It would also increase your index size, but then make queries faster. You might also index larger and larger co-occurrence sets, so that at search time you could "factor out" the largest set(s) that can apply to that query.

How large is each vector hash?

Sep 22 '20 14:09 mikemccand

Thanks again for digging into this a bit.

The countHits method looks fine, though you should not iterate to docs.cost()

Good to know, I'll fix that.

You might want to pre-sort the Hash in unicode order ... it might give a tiny improvement since the TermsEnum can share more internal state on each seekExact.

I might already be doing this in the query constructor, here The comparator just uses Java's Arrays.compareUnsigned, which, IIRC, is what Lucene does for ByteRefs. Is that what you are suggesting or something different?

Your countHits method is very similar to what Lucene's MultiTermQuery.rewrite does when there are many terms for the non-scoring case, except that method accumulates into a Lucene bitset rather than int[] since the Lucene Query is just concerned with knowing which documents matched, not how many terms matched each hit.

I modeled the implementation after the TermsInSetQuery. Which, similarly, only tracks whether a doc contained a matching term, not the actual count. It seems like the main difference and probably the main bottleneck between this custom query and prior art in Lucene is that I actually need to know how many matches there are in each doc. If you have any suggestions/tricks for very fast counting on the JVM, I'm all ears. As mentioned before, hashmaps, even from primitive collection libraries, seem to be about 2x slower.

How large is each vector hash?

With this benchmark config, there are 100 hashes per vector, 20 bytes each. I'd say that's a pretty typical setup.

Sep 22 '20 15:09 alexklibisz

Maybe there's a clever stopping criteria to visiting all of the terms? I started reading about MaxScore and WAND scoring. Maybe that's dead end here?

Sep 23 '20 02:09 alexklibisz

Put together a working early-stopping solution today. Roughly the idea is:

Compute the total number of term hits up front by iterating over the terms enum once.
Iterate over the terms and docs for each term again.
Use a counter to track the hits for each doc and a heap to track the top k hit counts.
For every doc, decrement the total number of term hits by the number of hits in that doc.
Stop iterating when the kth largest count is larger than the remaining number of term hits. Even if all of the remaining hits were for a single doc, it still couldn't exceed the kth largest.

Seems to work on the benchmark discussed previously. The times are roughly equivalent, but there's some obvious room for optimization. The method spends more time managing the heap than it does accessing lucene postings. I'm using a standard PriorityQueue<Integer>, and I bet if I wrote one using primitives it would be faster. The total runtime spends 10% of its time computing the kth largest hit from the array-backed counter. I just have to change some internal classes and methods to be able to use the kth largest from the heap instead of re-computing it.

Sep 28 '20 01:09 alexklibisz

You might want to pre-sort the Hash in unicode order ... it might give a tiny improvement since the TermsEnum can share more internal state on each seekExact.

I might already be doing this in the query constructor, here The comparator just uses Java's Arrays.compareUnsigned, which, IIRC, is what Lucene does for ByteRefs. Is that what you are suggesting or something different?

compareUnsigned is perfect and matches Lucene's term sort order (Unicode sort order on utf-8 encoded byte[]).

Put together a working early-stopping solution today. Roughly the idea is:

That sounds promising! Do you take the docFreq (or maybe totalTermFreq) of terms into account? E.g., collecting all term + docFreq from the TermsEnum, then sort them in increasing docFreq order? That way your early termination would save the most work (by skipping the terms that had many docs)? Though, maybe your terms (vector hashes) are somewhat uniform and their docFreq do not vary much.

Be sure to test on real vectors. Testing on random data leads to random conclusions!

Sep 29 '20 14:09 mikemccand

That sounds promising! Do you take the docFreq (or maybe totalTermFreq) of terms into account? E.g., collecting all term + docFreq from the TermsEnum, then sort them in increasing docFreq order? That way your early termination would save the most work (by skipping the terms that had many docs)? Though, maybe your terms (vector hashes) are somewhat uniform and their docFreq do not vary much.

I thought about this but my understanding was that Lucene is optimized to seek over the terms in sorted order? Maybe the cost of unsorted seeks over terms is trivial compared to other things here?

Sep 29 '20 14:09 alexklibisz

I'm afraid the early-stopping method as I described it isn't going to work. Specifically, it's pretty easy to find a case where a single vector matches for multiple consecutive hash terms and fills up the heap with its incrementing match counts. You end up determining that the cutoff for match counts is an old match count for that doc and you early-stop, but then there's only one doc that exceeds that cutoff. Replacing a doc's previous count in the heap might solve this? Edit: I got it to work by clearing the heap after each term. If the smallest count in the heap for a specific term exceeds the remaining possible counts, then you early stop. But this ends up still being slower than just counting all the docs. Setting this aside for a few days.

Oct 01 '20 14:10 alexklibisz

That sounds promising! Do you take the docFreq (or maybe totalTermFreq) of terms into account? E.g., collecting all term + docFreq from the TermsEnum, then sort them in increasing docFreq order? That way your early termination would save the most work (by skipping the terms that had many docs)? Though, maybe your terms (vector hashes) are somewhat uniform and their docFreq do not vary much.

I thought about this but my understanding was that Lucene is optimized to seek over the terms in sorted order? Maybe the cost of unsorted seeks over terms is trivial compared to other things here?

You are right, Lucene would prefer that you seek the terms in Unicode order.

But it allows you to NOT do so, and the performance penalty might be lower than the gains you could see by early terminating sooner?

Just pure speculation ... I think you would need to test and see to know for sure.

Oct 02 '20 16:10 mikemccand

I'm afraid the early-stopping method as I described it isn't going to work. Specifically, it's pretty easy to find a case where a single vector matches for multiple consecutive hash terms and fills up the heap with its incrementing match counts. You end up determining that the cutoff for match counts is an old match count for that doc and you early-stop, but then there's only one doc that exceeds that cutoff. Replacing a doc's previous count in the heap might solve this?

Hmm, I see, your top-K heap allows the same doc to be added in multiple slots?

Edit: I got it to work by clearing the heap after each term. If the smallest count in the heap for a specific term exceeds the remaining possible counts, then you early stop. But this ends up still being slower than just counting all the docs. Setting this aside for a few days.

OK, sigh :)

I still think you should explore indexing "sets of commonly co-occurring hashes" if possible. If your query-time vectors are such that the same sets of hashes commonly co-occur (i.e. you can predict these common subsets offline, before indexing, with highish accuracy to future queries), the impact can be sizable. It'd make your index a bit larger, but allow simpler/faster queries at search time.

Oct 02 '20 16:10 mikemccand

OK, sigh :)

I still think you should explore indexing "sets of commonly co-occurring hashes" if possible. If your query-time vectors are such that the same sets of hashes commonly co-occur (i.e. you can predict these common subsets offline, before indexing, with highish accuracy to future queries), the impact can be sizable. It'd make your index a bit larger, but allow simpler/faster queries at search time.

I agree.. I'm trying to think through some ways that could work without requiring a huge sample of representative vectors up-front.

Oct 02 '20 17:10 alexklibisz

I'm trying to make precise what advantage indexing co-ocurring hashes would have. If we assume that Lucene's retrieval speed is maxed out, the next obvious speedup is to somehow decrease the number of irrelevant docs that I'm iterating over, i.e. increase the "quality" of candidates retrieved for each term. Aggregating co-occurring hashes would definitely lead to fewer terms. Would it lead to higher-quality matches for each term? As the number of hashes in an aggregate increases, the probability of retrieving an irrelevant candidate decreases. But that's also the exact same reasoning and procedure behind the k parameter in most of the LSH mappings (http://elastiknn.klibisz.com/api/#l2-lsh-mapping). The vector is already getting hashed L * k times, and the adjacent sets of k hashes are getting concatenated to form L hashes. So maybe I can get the same effect by increasing L and k, and my hunch is that the increased cost of computing more hashes is probably less than the decreased cost of Lucene iterating over irrelevant docs. Up til now I've been assuming (at least implicitly) that we should avoid a large number of terms, i.e. L and k should be as low as possible. A bit rambly.. but needed to get thoughts down in writing.

Oct 02 '20 19:10 alexklibisz

Oh, I was proposing a purely functionality neutral optimization, since indexing co-occurring hashes would result in fewer disjunctive terms at search-time, and should make your searches run faster? But you're right, the scoring would be impacted, unless you could e.g. make a co-occurring hash that matched 7 hashes record and count as +7 at search timed. I had not thought through the implications of changing the score ...

Oct 06 '20 19:10 mikemccand

Yeah - really we should be able to know precisely how many times each vector matched. It's already an approximate algorithm, so introducing more approximation for performance-sake might not be worth it.

I'm thinking we've possibly hit the limit here. This effort is mostly motivated by trying to get the performance on-par with some C/C++ in-memory approximate nearest neighbors solutions. Right now for this benchmark dataset, the queries/second is 1 to 2 OOM lower than those alternatives. Of course there are other tradeoffs that make this a favorable approach in some ways

But after trying out these options and getting some sanity checks on the current implementation (thanks again to @mikemccand!), I think I'm fine with the conclusion that we've hit the limit for now.

Oct 07 '20 00:10 alexklibisz

Since one of the major bottlenecks is reading the matched doc ids from the segment, the last thing I'm tempted to look into is whether there is some construct for caching access to the postings for a particular term in a segment? I don't see any existing construct like that in the Lucene library. I could hack it together myself pretty easily using guava. But, ideally it would also bust the cache when there are appends to a segment, and I'm not sure how I'd wire it up so that the cache is busted on appends.

Oct 07 '20 03:10 alexklibisz

I am wondering whether did you consider using top-k aggregation algorithms like the TA or NRA (see http://alumni.cs.ucr.edu/~skulhari/Top-k-Query.pdf) that avoid traversing the complete posting list (and actually counting the hits)? And how they compare with the early-termination strategy that you experimented with? Also, did you consider using the WANDScorer in Lucene? I guess that these algorithms also don't perform well in this case because the query has many terms?

Oct 07 '20 14:10 aecio

Hi @aecio . Thanks for the link. I wasn't aware of those algorithms so I'll read the slides, hopefully tonight.

I looked through the WANDScorer code. I didn't really see a way to apply it here. IIRC it looked like it was designed to optimize the top hits scoring for N individual TermQuery's, rather than one query with N terms. A couple months ago I was using a BooleanQuery with a TermQuery in a should clause for each of the terms. That was 2-3x slower compared to the current implementation and re-scoring the vectors with exact similarities was a much more awkward arrangement.

Oct 07 '20 15:10 alexklibisz

@aecio The algorithms in those slides seem worth trying. It seems that for both Fagin's Algorithm and the Threshold Algorithm, I'd need to keep an array of PostingsEnums to "step" through the docs for each of the terms. I'll probably try the TA first. It's not clear to me how I'd implement the No-Random-Access algorithm. Specifically, I'm confused on how they define the lower and upper bounds.

Oct 08 '20 01:10 alexklibisz

Took a first pass at the Threshold Algorithm this morning: https://github.com/alexklibisz/elastiknn/blob/top-k-algos/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L51-L134 A few of the tests pass.. definitely not all. There are probably some mistakes in there.

Oct 08 '20 14:10 alexklibisz

I'm not seeing any practical improvement from using the Threshold Algorithm. The kth greatest score is consistently ~2 OOM lower than the threshold. This makes some intuitive sense when you consider there are usually 300 to 600 query terms, k = 1000 to 10,000, and x_i is always 1. Extremely few of the docs will match all of the terms. Here's the latest code: https://github.com/alexklibisz/elastiknn/blob/top-k-algos/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L56

It is nice to know that I can keep an iterator for each term and step through the sorted docs, effectively doing a document-first traversal rather than a term-first. That might be useful in revisiting my original idea for early-stopping.

Oct 09 '20 14:10 alexklibisz

I've been going through the various optimizations described in this fantastic paper: http://www.dcs.gla.ac.uk/~craigm/publications/fnt-efficient-query-processing.pdf

So far I've tried the TAAT Quit and Continue described on page 39, TAAT MaxScore on page 41, and DAAT MaxScore on page 47. TAAT Quit and Continue yields a significant speedup on the Fashion MNIST dataset. ~~0.7 recall was previously at ~190 queries/second, now at 260 queries/second.~~ So far none of them have yielded much of a speedup for the pathological parameters on the SIFT dataset.

Edit: this was sadly a bug in how I was evaluating recall. The change made it so that there were duplicate docs returned. Those duplicates were very good.. but a query for k=100 docs was usually only returning about 30 distinct docs. :facepalm:

Oct 11 '20 20:10 alexklibisz

Thanks for reporting the results of your experiments @alexklibisz!

The paper you mentioned is indeed excellent. If you're willing to dig deeper into this, the following paper is more recent and describes new techniques not included in that paper: https://dl.acm.org/doi/10.1145/3397271.3401076 I'm not sure they would be directly applicable and improve for your use case, but they have some plots that show their algorithm becomes particularly more efficient when the number of terms in query increases. They compare with the several variants of these optimizations, including BlockMax MaxScore, BlockMax WAND, and others.

WRT the previous comments, can you clarify what you mean by "The kth greatest score is consistently ~2 OOM lower than the threshold". I'm not sure what OOM means in this context.

Oct 11 '20 22:10 aecio

WRT the previous comments, can you clarify what you mean by "The kth greatest score is consistently ~2 OOM lower than the threshold". I'm not sure what OOM means in this context.

OOM = orders of magnitude. Not out-of-memory :)

Oct 11 '20 22:10 alexklibisz

@aecio I appreciate your tips. I'm checking out the paper you linked.

I'll also explain the "pathological parameters on the SIFT dataset" that I mentioned before. Basically, there are two datasets in the ann-benchmarks suite which use euclidean distance: Fashion-MNIST and SIFT. The distributions of the vector values are quite different, here's a histogram: SIFT has far more zeros than Fashion-MNIST. The Locality Sensitive Hashing model used by Elastiknn takes a width parameter w. Each of the vectors is projected onto a random scalar vector, and the line onto which they are projected is broken into segments of length w. Since there are so many zeros in the SIFT dataset, the projections (dot-products) are much smaller, so if the value of w is too large, you basically asign almost all of the vectors into a small handful of buckets. SIFT does best with w = 1 or 2. On the other hand, Fashion MNIST has larger dot-products, so it does best with w=6 or 7. When you run SIFT with w >= 3 almost every term ends up matching 50-70% of the vectors in the corpus, forcing you to iterate over extremely long postings lists, which makes the searches extremely slow. Unfortunately, the ann-benchmarks framework doesn't let you say "only use w=1 or 2 on this dataset and only use w = 6 or 7 on this other dataset." So it ends up wasting a bunch of time and/or timing-out on SIFT with w>=3. That's a big part of what motivated this entire endeavor.

Oct 11 '20 22:10 alexklibisz

Thanks for the details. It all makes sense now. :)

If the problem is iterating over the long postings lists, then a promising solution might be to use the BlockMax optimizations that allow you to jump whole blocks of the posting lists that have no competitive scores (as opposed to evaluating every single doc nextDoc()). Altough it seems the code can't be reused in your query, the WANDScorer implements this idea.

Oct 12 '20 17:10 aecio

As I read more about the block-max strategies, I don't actually see it being a particularly useful optimization here. The reason I think this: Every vector ("document") has the exact same number of hashes ("terms"). So the term upper bound for any given term should be the same for every doc in the postings list for that term. Maybe I'm missing something, but the block-max strategies seem to be predicated on the idea that there are blocks of docs where the document upper-bound within that block is so low that you shouldn't even bother looking, so you skip the whole block.

Oct 12 '20 17:10 alexklibisz

I thought that because each hash (term) has a numeric value (frequency) associated with it, the score upper bounds would not be the same because they would depend on the frequency value. But I may probably be missing something as well, I only have a shallow understanding of both your query and these query optimizations.

Oct 12 '20 18:10 aecio

Ah, terms do technically have an associated frequency. It's only > 1 for one of the LSH models, Permutation LSH when repeating = true. However, repeating = false seems to outperform repeating = true.. So I've been basically pretending every term appears exactly once for the sake of this problem. I've also considered, if I get a sufficient speedup here, just discarding that setting.

Oct 12 '20 18:10 alexklibisz

elastiknn elastiknn copied to clipboard

Optimize top-k counting for approximate queries

elastiknn
elastiknn copied to clipboard