Add VP-Tree HashIndex
Or somehow generalize this into the CodeIndex representation structure. Its pretty much the case where all LSH-based nearest neighbor algorithms will/can use the same nearest-code-neighbor strategy once hashs are generated.
Postponing this feature due to VP-Trees not performing as expected.
Given the recent overhaul of the LSH algorithm modularity, this would now be implemented as a new HashIndex implementation
I'd be happy to have a crack at this if it's still of interest. I didn't see much on the dev/vptree-nn-index branch, so a few questions:
- Was there an existing implementation of VP-trees that you were hoping to use here?
- If not, is performance enough of a consideration that you'd prefer this be implemented as a C++ extension rather than pure Python?
- In the same vein, are there performance benchmarks I should test this against before submitting a PR?
Hi @wphicks! Thanks for your interest. Incoming wall of text.
-
I don't have a specific implementation in mind with regards to back an algorithm wrapper. I have a local implementation of VP, VP-s and VP-sb trees as described by the original paper by Yianilos. I think it works given some fairly basic testing but that code has not been checked in yet anywhere. I need to double check some things first before I can attempt to put that code somewhere.
-
The basic performance check is to get an improvement in build and/or query time against the KD and ball tree implementations in scikit-learn. Literature claims that VP-trees should be faster and more space efficient than kd/ball trees. If that can be achieved via pure python, then hooray. Otherwise a C implementation would be required for improvement, however that may complicate model serialization.
-
I don't have any standard nearest-neighbor performance benchmarks implemented in SMQTK, but a @fishcorn (a coworker of mine) has used cumulative gains as a comparative measure. He can explain that measure better than I can at the moment.
In other details, the test implementation that I wrote as well as a this other implementation seem to have a deficiency in regards to indexing values using the hamming distance metric. I observed that both vp-tree implementations broke down into brute force where as the sklearn ball-tree performed as with other distance metrics. Now, this may be because both implementations have flaws that cause this, or vp-trees just break down with this metric, I don't currently remember or know. This should probably be investigated.
Okay, great! Responding point-by-point:
- I'll go ahead and start doing my own implementation until/unless you can put your local implementations out in some form. If you do put it out sometime soon, I'd be happy to do testing and/or help flesh out whatever you've got.
- Roger that- I'll start with pure Python and see how it goes.
- Sounds good! @fishcorn, any details are much appreciated.
In terms of your final point, I'll do some research. I have a vague memory about something like this rattling around somewhere in the back of my skull, but I'll have a look at the literature on it.
I should note that the definition of cumulative gains in this context is somewhat loose.
When you retrieve K nearest neighbors to a particular query with a VP tree, you can rank them by actual distance. The idea here is to compare this against the exact nearest neighbors to see how well the query results adhere to ground truth.
So on the X axis, you have the RANK of the EXACT neighbor, and on the Y, you have the RANK of the QUERY RESULT. If the nth ranked exact neighbor is NOT present in the results, the graph keeps going horizontal. If it IS present, the graph will go both up and right.
If the result matches the exact near neighbors set, then the graph will climb up to a plateau at K and just go horizontally. Most of the time it won't, so it will go up more gradually than the exact graph.
On Wed, Sep 27, 2017, 4:44 PM William Hicks [email protected] wrote:
Okay, great! Responding point-by-point:
- I'll go ahead and start doing my own implementation until/unless you can put your local implementations out in some form. If you do put it out sometime soon, I'd be happy to do testing and/or help flesh out whatever you've got.
- Roger that- I'll start with pure Python and see how it goes.
- Sounds good! @fishcorn https://github.com/fishcorn, any details are much appreciated.
In terms of your final point, I'll do some research. I have a vague memory about something like this rattling around somewhere in the back of my skull, but I'll have a look at the literature on it.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Kitware/SMQTK/issues/177#issuecomment-332649380, or mute the thread https://github.com/notifications/unsubscribe-auth/ABInCPfAFTWm-cRBRj_Y8EtXE57WpY1Zks5smrOLgaJpZM4HAkNQ .
Makes sense, thanks! I'll get to work.
I should be able to put my tree implementations on a branch sometime today for you to pick apart.
Other implementations I've learned from and could be compared against the one coming on a branch in a few:
- http://www.logarithmic.net/pfh/blog/01164790008
- I've been initially comparing against this one.
- https://github.com/huyng/algorithms/tree/master/vptree
VpTree utilities now on branch dev/vp-tree-utils
Great, thanks! I'll take a look at it over the weekend and compare against what I've got so far.
Sorry, got my own wall of text for you this time!
After going through the Yianilos paper again, I think the poor performance with the Hamming distance metric is inherent to VP trees and not an implementation flaw. Specifically, consider two random bit vectors of length D. The probability that the (non-normalized) Hamming distance between them is r will be (1/2)^D * binomial(D, r). This violates the ZPS property Yianilos lays out in Definition 2, so the question is whether it's a strong enough violation that we'll consistently revert to a brute-force search.
Looking at Algorithm 2, the question is how often x will be an element of both I_l and I_r, which is to say how often the distance from our query (q) to the vantage point (v) of a given node will be between the lower bound of the node's right child-tree minus the current tolerance and the upper bound of the node's left tree plus the current tolerance. Let's make the generous assumption that that only happens when the distance from q to v is exactly the median distance (m) between v and all other vectors indexed in its child trees. If q is chosen randomly, then the probability of this happening is (1/2)^D * binomial(D, m), and v should be chosen to maximize or minimize m. Without specifying something about the distribution of our input vectors (i.e. that they have a particular kind of clustering), there's no guarantee that we can bring the probability of d(q, v) == m below any given threshold. If q is chosen from a distribution generally similar to the distribution of input vectors that were actually used to build the tree, I think the situation actually gets worse, though I haven't worked it out formally yet. In that case, I'm also reasonably sure that the usual method of trying to maximize the spread of distances to the input vectors when choosing v is still the best strategy.
The only good news here is that in the limit of large D, the probability of d(q, v) == m does go to 0 (for randomly selected q). I'm going to do some numerical testing, both to make sure I'm not fooling myself and to see if there's some dimensionality above which the performance beats out KD and ball trees. The question then is how large of descriptor vectors are typical for SMQTK to see if it's worth including this.
I think I will need to process your above explanation and maybe read the paper again, but it seems to follow what I've seen experimentally.
In regards to typical descriptor vector sizes, that depends on the content being described and the algorithm being used. Currently, we have been focusing on imagery and using CNN-based description approaches, which usually yield 2048 to 4096 element vectors. Depending how networks evolve going forward, that may fluctuate.
Okay, thanks! In that case, it may still be worth it. I'll run some tests at around those sizes and let you know what I find out.
I ran a bunch of tests over the weekend with your VP tree implementation (integrated into this fork). A few results:
- All variants of VP trees start to outperform the ball tree implementation (in terms of search runtime) for indices of ~1000 binary vectors or more at any dimension above ~16. For indices of 10000 vectors at dimension 4096, VP tree searches are about 25 times faster.
- Consistent with my intuition from before, VP trees perform substantially worse (in terms of nodes visited) when searching based on a query that is close to an indexed vectors than when searching based on a random query. Even at their worst behavior, though, they substantially outperform BTs for high (> ~16) dimension and large (> ~1000) indices.
- Building a VP tree is always substantially slower than building a BT. It scales comparably to (and slightly worse than) building a linear hash index.
One huge caveat on these results: In building and searching the VP trees, I converted the bit vectors to an integer representation using the functions in bit_utils.py (consistent with what is done for linear hash indices here). That turns out to be very expensive, accounting for 91% of the runtime in searching a VP tree at dimension 4096 (without numba). I think there might be a few ways to get around that conversion, but the takeaway should be that VP trees perform even better than these results suggest.
Given that, it seems worth it to use VP trees (for SMQTK's typical domain) after all. Assuming that all sounds good to you, I'm going to see if I can work around the vector to int conversion issue and then finish integrating this as a hash index.
Sounds promising. Did you create any scripts, graphs or plots to show your results? Are any of your tests something that could be used for performance unit-tests?
Feel free to optimize the vector/int conversion and create a PR for a VP-tree hash-index and/or nn-index.
Yes, I have a couple scripts that I used to generate a whole bunch of plots for different dimensions and numbers of indexed vectors. I was planning on putting the plots in a blog post later, but if there's a better/ more useful place to stick them, I'll be happy to. I've just pushed the scripts to this branch (here and here) in case you want to look at them, but they're very hacky. I can definitely scrape out the good stuff into some clean performance tests, though.
Sorry for the lack of response. I have been preparing for and am now in a field test environment. I will be able to get to taking a look are your linked work more in about a week.
No worries and thanks for all the help so far! I'm hoping to get the full PR done this weekend, so you can just take a look at that when you're back.