Radim Řehůřek comments

Results 318 comments of


                                            Radim Řehůřek

trafficstars

Fix integration tests

N/M, I see the commit here: https://github.com/RaRe-Technologies/smart_open/commit/909930e9f30ee04a609fc0c8910389b963d7dba3

Fix quadratic time ByteBuffer operations

A benchmark on some real `smart_open` use-case (closer to what users see / care about) would be great. I am myself curious of the difference.

Retrieving `k` nearest neighbours?

Alright, let's do it. I'm currently benchmarking some ANN libs [1] because I want to add an ANN algo to gensim (gensim only has brute force linear search now). But...

Retrieving `k` nearest neighbours?

I published the benchmark numbers: http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/ I had to leave out LSH because of the `k` problems. Do you have some plan of action as to how to mitigate/solve it?...

Retrieving `k` nearest neighbours?

Thanks a lot for the great write-up! I didn't think of the low density areas; my first, naive idea stopped at what your sketch suggests for "high density" areas :)...

Retrieving `k` nearest neighbours?

Yes I am :) Is this the same algo as used in Annoy? Though there were multiple trees used in Annoy IIRC, so I guess not. I'll plug this implementation...

Retrieving `k` nearest neighbours?

You mean in Annoy? There's only a param controlling the number of trees (perf/precision tradeoff) -- I evaluated various choices for this param in the "wikipedia shootout benchmark", from 1...

Historical word embeddings

All, preferably (and the non-English ones are particularly interesting).

Historical word embeddings

I don't understand. What is the problem?

Historical word embeddings

I see what you mean, but don't see it as a problem. Why couldn't the dataset loader just return a dictionary of models?