ann-benchmarks
ann-benchmarks copied to clipboard
Support for Jaccard and sparse data
I've been working on benchmarking pynndescent on other metrics such as Jaccard, and have been using ann-benchmarks
and the kosarak
dataset for that. Some recent PRs (#235 and #238) have changed things a little, and my initial efforts to get the benchmark working with Kosarak were not successful. After a little experimentation it seems that the data is arriving as a dense boolean 2d array (which is, in itself, easy enough to work with, but a little heavy). I adapted the code the handle this (benefits include getting to use pdist
for the results distance computations), including getting nmslib
up and running with sparse Jaccard (thus BallTree
, SW-graph
and hnsw
all work). I was going to put in a PR for this, but then got reading the earlier PRs. Dense boolean arrays was clearly not the intended format from those PRs. Perhaps the kosarak
dataset reference has not been updated on ann-benchmarks.com? Regardless I realised I should seek some clarification about both the current state of the Jaccard benchmark, and the intended data format going forward to ensure I submit something sensible as a PR.
I recently ran a benchmark, and I wasn't able to get Jaccard working, so I think something is weird with it. If you know how to fix it, I'd love to see a PR!
I'll try to have a PR soon. Hopefully it will cover both versions of what the sparse data could/should look like. I'm trying to get NGT working (the C++ has a sparse jaccard; not sure if I can make it work with the python interface yet). Either way hopefully next week I can have a PR for review.
I just ran puffinn on kosarak and that works fine. It's indeed the shared file that still uses the old format. Could you update it @erikbern by creating locally and copying to ann-benchmarks.com?
I would be interested in hearing your thoughts about the sparse data format, @lmcinnes. I think the sparse format that @GuilhemN suggested works very well.
I just ran puffinn on kosarak and that works fine. It's indeed the shared file that still uses the old format. Could you update it @erikbern by creating locally and copying to ann-benchmarks.com?
ok, i can do
I think the new sparse format makes sense. My personal preference would be to use a scipy.sparse
matrix format, as that is pretty standard for sparse data. Notably it also allows for easy conversion between a variety of sparse formats. The format @GuilhemN proposed is essentially CSR (compact sparse row) format (an indptr array providing information about how each row indexes into a flat indices array, but also allows for a data array if wanted to do, for example, sparse cosine as well as jaccard), which is the standard for simple compact representations. Other formats include LIL (list of lists), DOK (dictionary of keys), COO (coordinate format; essentially row, col, value triples), which can also be useful. Using scipy.sparse
instead of a custom format would allow for easy translation between any of these (a single method call). Of course only a limited range of libraries support sparse format right now, so that might not be necessary.