big-ann-benchmarks
big-ann-benchmarks copied to clipboard
Questions about Dataset class Interface and dataset prepare
Question about dataset class Interface
I wonder if it is okay to use all methods (interfaces) exposed in the Dataset class when implementing the algorithms to be used in the benchmark.
I am trying to access the file directly by using get_dataset_fn
method instead of get_dataset_iterator
method, and I wonder if this is not an issue.
If possible, there seems to be something wrong with the implementation of the get_dataset_fn
method for small datasets.
In the get_dataset_fn
method, if there is an original (1-billion) file, the path of the original file is returned. When used in get_dataset_iterator
method, it seems reasonable because only a part of the original file is used by mmap. However, if get_dataset_fn
is an externally exposed interface, it would be appropriate to give the path of the actual small file. Or, when using the get_dataset_fn
method, if it is a small dataset but not a crop file, I am wondering if I should use only a part of the file.
Qustion about dataset prepare
https://github.com/harsha-simhadri/big-ann-benchmarks/blob/8180e0e6ea5b8e36f76c5e34728116f0de23c05e/benchmark/main.py#L145 I wonder if it can be assumed that the dataset file is downloaded in actual evaluation.
The original second question was during the build phase of the benchmark process. It was a question about whether I should prepare data (downlaod) myself for build(In the benchmark code, skipdata is set to True)
But someone might be wondering if it is possible to use dataset vectors at the time of search.
In evaluation, the dataset is not available. For T2, the index can store a copy of the data (or a compressed version) as part of the 1TB limit on index size In index build it is, and contributes to the total storage limit.
When I look at get_dataset_fn
, it seems to me that it returns the actual file in case you are working with a cropped version.
https://github.com/harsha-simhadri/big-ann-benchmarks/blob/main/benchmark/datasets.py#L268-L276. This should be safe to use. Maybe I'm misunderstanding your question, @gony0 ?
In general, you will have to take care of dowloading the base vectors by explicitly using python create_dataset.py --dataset ...
. However, it would be easy to add an argument to run.py
so that it takes care of this. I will happily do that if this seems to be a common use case.
@maumueller Sorry for the late reply
it seems to me that it returns the actual file in case you are working with a cropped version.
If there are both a 1B size file and a crop file, a 1B size file path is always returned.
see below codes.
https://github.com/harsha-simhadri/big-ann-benchmarks/blob/59eab9fc9da096d50ea1f0149a5e0e6f8c141f32/benchmark/datasets.py#L283-L286
ds_fn
has a 1B size file name and creates a crop file name only when there is no 1B size file.
I talked about where this code didn't work well in my development environments(not actual competition evaluation environment) like both 1B dataset and small dataset were exists.