Radim Řehůřek
Radim Řehůřek
N/M, I see the commit here: https://github.com/RaRe-Technologies/smart_open/commit/909930e9f30ee04a609fc0c8910389b963d7dba3
A benchmark on some real `smart_open` use-case (closer to what users see / care about) would be great. I am myself curious of the difference.
Alright, let's do it. I'm currently benchmarking some ANN libs [1] because I want to add an ANN algo to gensim (gensim only has brute force linear search now). But...
I published the benchmark numbers: http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/ I had to leave out LSH because of the `k` problems. Do you have some plan of action as to how to mitigate/solve it?...
Thanks a lot for the great write-up! I didn't think of the low density areas; my first, naive idea stopped at what your sketch suggests for "high density" areas :)...
Yes I am :) Is this the same algo as used in Annoy? Though there were multiple trees used in Annoy IIRC, so I guess not. I'll plug this implementation...
You mean in Annoy? There's only a param controlling the number of trees (perf/precision tradeoff) -- I evaluated various choices for this param in the "wikipedia shootout benchmark", from 1...
All, preferably (and the non-English ones are particularly interesting).
I don't understand. What is the problem?
I see what you mean, but don't see it as a problem. Why couldn't the dataset loader just return a dictionary of models?