ann-benchmarks big-ann-benchmarks.com

Just saw http://big-ann-benchmarks.com

I'm a bit surprised the authors of that competition didn't mention anything to me or the other people working on this benchmark suite? Feels like there's a lot of overlap and I wish we could have worked together on it.

cc @maumueller

May 20 '21 14:05 erikbern

Actually just realized @maumueller is in the list of organizers... are you involved? Curious about your thoughts on whether it could be merged into this repo. I think there's a lot of value in having reproducible containers and everything open source.

May 20 '21 14:05 erikbern

Yes, I'm involved. You were also part of the initial email exchange as far as I remember?

The framework is following ann-benchmarks (and proper credit will be given), but there are some interesting changes with regard to the API and dataset/groundtruth handling. We plan to release the framework in about two weeks, I'll let you know!

May 20 '21 18:05 maumueller

Sorry I must have missed that email or forgotten about it. Looking forward to see the code, hopefully there's an opportunity to collaborate!

May 21 '21 01:05 erikbern

Hello! Is there some progress on releasing the new framework? Specifically I'm interested to see the I/O libraries to handle loading of 100M or 1B vectors. When using for instance HNSW directly (code taken from the creators of it), data load becomes a bottlneck -- and this is not an issue of HNSW, but more a pure I/O slowness. I've been using this code implemented by Yandex (as part of open sourcing their 1B dataset): https://github.com/DmitryKey/bert-solr-search/blob/feature/hnswlib/src/util/utils.py#L57

Jul 10 '21 10:07 DmitryKey

We are in the final phase, so hopefully the framework will be released in a week or two. We memory-map binary files, roughly like https://github.com/facebookresearch/faiss/blob/master/contrib/datasets.py#L194-L199.

Jul 10 '21 16:07 maumueller

sounds great! Looking forward to test it.

Jul 10 '21 17:07 DmitryKey