ann-benchmarks icon indicating copy to clipboard operation
ann-benchmarks copied to clipboard

Support for Jaccard and sparse data

Open lmcinnes opened this issue 2 years ago • 6 comments

I've been working on benchmarking pynndescent on other metrics such as Jaccard, and have been using ann-benchmarks and the kosarak dataset for that. Some recent PRs (#235 and #238) have changed things a little, and my initial efforts to get the benchmark working with Kosarak were not successful. After a little experimentation it seems that the data is arriving as a dense boolean 2d array (which is, in itself, easy enough to work with, but a little heavy). I adapted the code the handle this (benefits include getting to use pdist for the results distance computations), including getting nmslib up and running with sparse Jaccard (thus BallTree, SW-graph and hnsw all work). I was going to put in a PR for this, but then got reading the earlier PRs. Dense boolean arrays was clearly not the intended format from those PRs. Perhaps the kosarak dataset reference has not been updated on ann-benchmarks.com? Regardless I realised I should seek some clarification about both the current state of the Jaccard benchmark, and the intended data format going forward to ensure I submit something sensible as a PR.

lmcinnes avatar Jan 19 '22 19:01 lmcinnes

I recently ran a benchmark, and I wasn't able to get Jaccard working, so I think something is weird with it. If you know how to fix it, I'd love to see a PR!

erikbern avatar Jan 20 '22 02:01 erikbern

I'll try to have a PR soon. Hopefully it will cover both versions of what the sparse data could/should look like. I'm trying to get NGT working (the C++ has a sparse jaccard; not sure if I can make it work with the python interface yet). Either way hopefully next week I can have a PR for review.

lmcinnes avatar Jan 20 '22 03:01 lmcinnes

I just ran puffinn on kosarak and that works fine. It's indeed the shared file that still uses the old format. Could you update it @erikbern by creating locally and copying to ann-benchmarks.com?

maumueller avatar Jan 20 '22 09:01 maumueller

I would be interested in hearing your thoughts about the sparse data format, @lmcinnes. I think the sparse format that @GuilhemN suggested works very well.

maumueller avatar Jan 20 '22 09:01 maumueller

I just ran puffinn on kosarak and that works fine. It's indeed the shared file that still uses the old format. Could you update it @erikbern by creating locally and copying to ann-benchmarks.com?

ok, i can do

erikbern avatar Jan 20 '22 13:01 erikbern

I think the new sparse format makes sense. My personal preference would be to use a scipy.sparse matrix format, as that is pretty standard for sparse data. Notably it also allows for easy conversion between a variety of sparse formats. The format @GuilhemN proposed is essentially CSR (compact sparse row) format (an indptr array providing information about how each row indexes into a flat indices array, but also allows for a data array if wanted to do, for example, sparse cosine as well as jaccard), which is the standard for simple compact representations. Other formats include LIL (list of lists), DOK (dictionary of keys), COO (coordinate format; essentially row, col, value triples), which can also be useful. Using scipy.sparse instead of a custom format would allow for easy translation between any of these (a single method call). Of course only a limited range of libraries support sparse format right now, so that might not be necessary.

lmcinnes avatar Mar 03 '22 14:03 lmcinnes