big-ann-benchmarks icon indicating copy to clipboard operation
big-ann-benchmarks copied to clipboard

Questions about Dataset class Interface and dataset prepare

Open gony-noreply opened this issue 2 years ago • 4 comments

Question about dataset class Interface

I wonder if it is okay to use all methods (interfaces) exposed in the Dataset class when implementing the algorithms to be used in the benchmark.

I am trying to access the file directly by using get_dataset_fn method instead of get_dataset_iterator method, and I wonder if this is not an issue. If possible, there seems to be something wrong with the implementation of the get_dataset_fn method for small datasets.

In the get_dataset_fn method, if there is an original (1-billion) file, the path of the original file is returned. When used in get_dataset_iterator method, it seems reasonable because only a part of the original file is used by mmap. However, if get_dataset_fn is an externally exposed interface, it would be appropriate to give the path of the actual small file. Or, when using the get_dataset_fn method, if it is a small dataset but not a crop file, I am wondering if I should use only a part of the file.


Qustion about dataset prepare

https://github.com/harsha-simhadri/big-ann-benchmarks/blob/8180e0e6ea5b8e36f76c5e34728116f0de23c05e/benchmark/main.py#L145 I wonder if it can be assumed that the dataset file is downloaded in actual evaluation.

gony-noreply avatar Sep 24 '21 13:09 gony-noreply

The original second question was during the build phase of the benchmark process. It was a question about whether I should prepare data (downlaod) myself for build(In the benchmark code, skipdata is set to True)

But someone might be wondering if it is possible to use dataset vectors at the time of search.

gony-noreply avatar Sep 24 '21 13:09 gony-noreply

In evaluation, the dataset is not available. For T2, the index can store a copy of the data (or a compressed version) as part of the 1TB limit on index size In index build it is, and contributes to the total storage limit.

harsha-simhadri avatar Sep 24 '21 18:09 harsha-simhadri

When I look at get_dataset_fn, it seems to me that it returns the actual file in case you are working with a cropped version.

https://github.com/harsha-simhadri/big-ann-benchmarks/blob/main/benchmark/datasets.py#L268-L276. This should be safe to use. Maybe I'm misunderstanding your question, @gony0 ?

In general, you will have to take care of dowloading the base vectors by explicitly using python create_dataset.py --dataset .... However, it would be easy to add an argument to run.py so that it takes care of this. I will happily do that if this seems to be a common use case.

maumueller avatar Sep 24 '21 20:09 maumueller

@maumueller Sorry for the late reply

it seems to me that it returns the actual file in case you are working with a cropped version.

If there are both a 1B size file and a crop file, a 1B size file path is always returned. see below codes. https://github.com/harsha-simhadri/big-ann-benchmarks/blob/59eab9fc9da096d50ea1f0149a5e0e6f8c141f32/benchmark/datasets.py#L283-L286 ds_fn has a 1B size file name and creates a crop file name only when there is no 1B size file.

I talked about where this code didn't work well in my development environments(not actual competition evaluation environment) like both 1B dataset and small dataset were exists.

gony-noreply avatar Oct 06 '21 01:10 gony-noreply