Alex Klibisz
Alex Klibisz
Yeah - really we should be able to know precisely how many times each vector matched. It's already an approximate algorithm, so introducing more approximation for performance-sake might not be...
Since one of the major bottlenecks is reading the matched doc ids from the segment, the last thing I'm tempted to look into is whether there is some construct for...
Hi @aecio . Thanks for the link. I wasn't aware of those algorithms so I'll read the slides, hopefully tonight. I looked through the WANDScorer code. I didn't really see...
@aecio The algorithms in those slides seem worth trying. It seems that for both Fagin's Algorithm and the Threshold Algorithm, I'd need to keep an array of PostingsEnums to "step"...
Took a first pass at the Threshold Algorithm this morning: https://github.com/alexklibisz/elastiknn/blob/top-k-algos/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L51-L134 A few of the tests pass.. definitely not all. There are probably some mistakes in there.
I'm not seeing any practical improvement from using the Threshold Algorithm. The kth greatest score is consistently ~2 OOM lower than the threshold. This makes some intuitive sense when you...
I've been going through the various optimizations described in this fantastic paper: http://www.dcs.gla.ac.uk/~craigm/publications/fnt-efficient-query-processing.pdf So far I've tried the TAAT Quit and Continue described on page 39, TAAT MaxScore on page...
> WRT the previous comments, can you clarify what you mean by "The kth greatest score is consistently ~2 OOM lower than the threshold". I'm not sure what OOM means...
@aecio I appreciate your tips. I'm checking out the paper you linked. I'll also explain the "pathological parameters on the SIFT dataset" that I mentioned before. Basically, there are two...
As I read more about the block-max strategies, I don't actually see it being a particularly useful optimization here. The reason I think this: Every vector ("document") has the exact...