opensearch-benchmark
opensearch-benchmark copied to clipboard
[FEATURE] Softer validation of corpora workload parameters for vectorsearch benchmark
Is your feature request related to a problem? Please describe
Currently each workload corpus requires a target index parameter when there are multiple indices. However, the vector search bulk ingest workload operation (bulk-vector-data-set
) does not use this target index when ingesting data. Instead, users specify the target index as a parameter in the custom-bulk ingest operation in their test procedure.
I'm opening this issue because the target-index
parameter is required at workload validation time despite it being unnecessary for vector search workloads. As a result the VS workload.json
must contain unused parameters.
Describe the solution you'd like
One solution is to make the target-index
corpora parameter optional at validation time. Perhaps it's also possible to enforce that either the parameter is specified in the corpora in workloads.json
or that bulk-vector-data-set
is used in a test procedure.
Describe alternatives you've considered
I don't have all the context for why there are two ways of specifying the target index for ingesting data but I believe it's due to vector datasets being in hdf5 format and the normal bulk
operation requiring json documents.
Additional context
There is another issue stemming from VS ingestion being different than normal ingestion — Issue 317 in the workloads repo requests a VS feature that's available in non-VS workloads. I think the lack of feature parity is due to the vector-bulk
operation being different from the normal bulk
operation.