big-ann-benchmarks.com
Just saw http://big-ann-benchmarks.com
I'm a bit surprised the authors of that competition didn't mention anything to me or the other people working on this benchmark suite? Feels like there's a lot of overlap and I wish we could have worked together on it.
cc @maumueller
Actually just realized @maumueller is in the list of organizers... are you involved? Curious about your thoughts on whether it could be merged into this repo. I think there's a lot of value in having reproducible containers and everything open source.
Yes, I'm involved. You were also part of the initial email exchange as far as I remember?
The framework is following ann-benchmarks (and proper credit will be given), but there are some interesting changes with regard to the API and dataset/groundtruth handling. We plan to release the framework in about two weeks, I'll let you know!
Sorry I must have missed that email or forgotten about it. Looking forward to see the code, hopefully there's an opportunity to collaborate!
Hello! Is there some progress on releasing the new framework? Specifically I'm interested to see the I/O libraries to handle loading of 100M or 1B vectors. When using for instance HNSW directly (code taken from the creators of it), data load becomes a bottlneck -- and this is not an issue of HNSW, but more a pure I/O slowness. I've been using this code implemented by Yandex (as part of open sourcing their 1B dataset): https://github.com/DmitryKey/bert-solr-search/blob/feature/hnswlib/src/util/utils.py#L57
We are in the final phase, so hopefully the framework will be released in a week or two. We memory-map binary files, roughly like https://github.com/facebookresearch/faiss/blob/master/contrib/datasets.py#L194-L199.
sounds great! Looking forward to test it.