opensearch-benchmark icon indicating copy to clipboard operation
opensearch-benchmark copied to clipboard

Support popular vector search dataset like sift, gist as corpus that can be downloaded from public repository

Open VijayanB opened this issue 1 year ago • 6 comments

Is your feature request related to a problem? Please describe. Similar to nyctaxi, geonames corpus, OpenSerach Benchmark can support some of the popular vector search datasets, that can be downloaded as corpus and used in vector serach workload instead of downloading manually every time for standard usecases. This can be added as part of nightly runs too.

A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] While setting up private corpus repository for vector search workload, i get exception at https://github.com/opensearch-project/opensearch-benchmark/blob/9ffbec01149b3a75f58c3d49bb2f9ea39ca1fbd8/osbenchmark/utils/io.py#L578. This is expected since vector search datasets are not standard utf-8 file. This is blocker for adding custom dataset as corpus to vectorsearch workload.

Describe the solution you'd like

A clear and concise description of what you want to happen. prepare_file_offset_table method was required to optimize disk read for large file by creating offset table that creates mapping from line number to file offset. This is not required for vectorsearch datasets, since, we don't need this offset table for bulk ingestion at this moment. We should make this creation of file offset table optional to extend corpus eligibility criteria to support multiple formats.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered. Manually download those files into temp directory and update input file path to points to downloaded location.

Additional context

Add any other context or screenshots about the feature request here.

VijayanB avatar Jan 19 '24 22:01 VijayanB

@VijayanB Can you provide what using with and without the corpus would look like to an end user?

jmazanec15 avatar Feb 08 '24 19:02 jmazanec15

Vector search workload requires three external inputs like train ( vectors to index ), test ( vectors to search ) and neighbors ( ground truth ). It works well for average/advance users who wants to benchmark performance against specific dataset. However for simple users or nightly runs, we can use some of the popular datasets to consistently measure performance across every runs. In this case it will be easier and better user experience, if vectorsearch has ability to just run workload similar to nyctaxi instead of downloading those standard inputs every time

VijayanB avatar Feb 08 '24 19:02 VijayanB

@VijayanB Makes sense. Could you show in issue what users have to do now vs. what they could once change is added? i.e. cli comands, etc. Just want to make sure I understand the change in experience.

jmazanec15 avatar Feb 08 '24 19:02 jmazanec15

@jmazanec15 Sure. i was about to raise PR in workload, Let me share how new param file will look like

{
    "target_index_name": "target_index",
    "target_field_name": "target_field",
    "target_index_body": "indices/nmslib-index.json",
    "target_index_primary_shards": 1,
    "target_index_dimension": 128,
    "target_index_space_type": "l2",
    
    "target_index_bulk_size": 100,
    "target_index_bulk_index_data_set_format": "hdf5",
    "target_index_bulk_index_data_set_corpus": "sift-128-euclidean-train",
    "target_index_bulk_indexing_clients": 10,
    
    
    "target_index_max_num_segments": 10,
    "target_index_force_merge_timeout": 45.0,
    "hnsw_ef_search": 100,
    "hnsw_ef_construction": 100,
    "query_k": 100,

    "query_data_set_format": "hdf5",
    "query_data_set_corpus":"sift-128-euclidean-test",
    "neighbors_data_set_format": "hdf5",
    "neighbors_data_set_corpus":"sift-128-euclidean-neighbors",
    "query_count": 100
  }

As a pre req, corresponding corpus should be added to workload.json file similar to other workloads This doesn't break existing behavior where users could provide file path.

VijayanB avatar Feb 08 '24 19:02 VijayanB

@VijayanB thanks, I see so corpus for the most part will represent just a file that abstracts away the location of it, correct?

jmazanec15 avatar Feb 08 '24 19:02 jmazanec15

@jmazanec15 Thats correct. It can represent more than 1 file, but, at this moment we don't support folder or multiple files as input, hence, added check to make sure that more than one documents (or files) is not supported.

VijayanB avatar Feb 08 '24 19:02 VijayanB