lucene Fix failure to load larger data sets in KnnGraphTest

When running the reindex task with KnnGraphTest, exceptionally large datasets can be used. Since mmap is used to read the data, we need to know the buffer size. This size is limited to Integer.MAX_VALUE, which is inadequate for larger datasets.

An example data set that the current behavior fails on is: http://sites.skoltech.ru/compvision/noimi/ (deep-image-96-angular in ann-benchmarks).

Specifically deep-image-96-angular dataset in mapped memory has a size of 9990000 * 96 * 4 (docNum * dim * byteSizeOf(Float32)). As an int, this rolls over to -458807296, as a long: 3836160000.

So, this commit adds back the iterative batching, taking batch sizes near Integer.MAX_VALUE that are a multiple of dim * byteSize.

Oct 13 '22 15:10 benwtrent

@jtibshirani or @msokolov care to review? The bug was introduced back in https://github.com/apache/lucene/pull/1054

Oct 13 '22 17:10 benwtrent

Thanks for fixing this @benwtrent ! I wonder if we could take the simpler approach of just opening the file, and iterating through the vectors one by one. I don't think there's a clear performance benefit to mmapping sections of the file, since we are always just iterating through the vectors in order (to then index them, search them one-by-one, etc.)

Oct 13 '22 18:10 jtibshirani

@jtibshirani My goal here was to fix the bug with as much as the original design as possible. I didn't want to spend a bunch of time re-factoring this code.

I am open to simply read bytes from the file directly instead of using mmap. What say you @msokolov ?

Oct 13 '22 19:10 benwtrent

@jtibshirani I confirmed that on my M1 Macbook, there is no significant change in QPS, I tested glove-100-angular and deep-image-96-angular.

Oct 17 '22 22:10 benwtrent