opensearch-benchmark Support hdf5 files in bulk operation

Support hdf5 files in bulk operation

Open finnroblin opened this issue 6 months ago • 0 comments

Description

Adds hdf5 file support for bulk ingestion. hdf5 files contain datasets of vectors in a non-json format so @VijayanB wrote separate parameter operations to send vectors to the bulk API. This PR adds vector support within OSB's bulk operation. This is advantageous for vector search benchmarking since the bulk operation supports additional features, and it decreases the number of vector search-specific features.

Testing

[X] New functionality includes testing

Unit tests and manual verification. I modified the cohere 1000 document to include the information needed for the bulk operation.

Steps taken for manual verification: Parameter file:

{
    "target_index_name": "target_index",
    "target_field_name": "target_field",
    "target_index_body": "indices/faiss-index.json",
    "target_index_primary_shards": 1,
    "target_index_dimension": 768,
    "target_index_space_type": "l2",
    
    "target_index_bulk_size": 5,
    "target_index_bulk_index_data_set_format": "hdf5",
    "target_index_bulk_indexing_clients": 10,
    "target_index_bulk_index_data_set_corpus": "cohere",
    
    "target_index_max_num_segments": 1,
    "target_index_force_merge_timeout": 300,
    "hnsw_ef_search": 100,
    "hnsw_ef_construction": 100,

    "query_k": 100,
    "query_body": {
         "docvalue_fields" : ["_id"],
         "stored_fields" : "_none_"
    },

    "query_data_set_format": "hdf5",
    "query_data_set_corpus": "cohere",
    "query_count": 100
}

Bulk schedule:

{
    "operation": {
        "name": "delete-target-index",
        "operation-type": "delete-index",
        "only-if-exists": true,
        "index": "{{ target_index_name | default('target_index') }}"
    }
},
{
    "operation": {
        "name": "create-target-index",
        "operation-type": "create-index",
        "index": "{{ target_index_name | default('target_index') }}"
    }
},
{
    "operation": {
        "name": "bulk",
        "operation-type": "bulk",
        "bulk-size": 5,
        "data_set_format": "{{ target_index_bulk_index_data_set_format | default('hdf5') }}",
        "source_format": "hdf5",
        "index": "target_index",
        "field": "target_field",
        "vector_dataset_context": "index",
        "corpora": ["cohere"]
    },
    "clients": {{ target_index_bulk_indexing_clients | default(1)}}
},
{
    "name" : "refresh-target-index",
    "operation" : "refresh-target-index"
}

Corpus changes:

"corpora": [
    {
      "name": "cohere",
      "base-url": "https://dbyiw3u3rf9yr.cloudfront.net/corpora/vectorsearch/cohere-wikipedia-22-12-en-embeddings",
      "target-index": "{{ target_index_name }}",
      "documents": [
        {
          "source-file": "documents-1k.hdf5.bz2",
          "source-format": "hdf5",
          "document-count": 1000,
          "generate-increasing-vector-ids": true,
          "id-field-name": "_id",
          "vector-field-name": "target_field"
        }
      ]
    },

bulk-procedure:

    "name": "bulk-procedure",
    "default": false,
    "schedule": [
       {{ benchmark.collect(parts="common/bulk-schedule.json") }},

       {{ benchmark.collect(parts="common/search-only-schedule.json") }}
    ]
},

Result:

.venv) finnrobl@80a9970f4597 opensearch-benchmark % export PARAMS=/Users/finnrobl/Code/opensearch-benchmark-workloads/vectorsearch/params/bulk-params.json 
(.venv) finnrobl@80a9970f4597 opensearch-benchmark % opensearch-benchmark execute-test --target-hosts $ENDPOINT \                                                
    --workload-path /Users/finnrobl/Code/opensearch-benchmark-workloads/vectorsearch  --workload-params $PARAMS \
    --pipeline benchmark-only \
    --kill-running-processes \
  --test-procedure bulk-procedure

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] [Test Execution ID]: e8307702-7dda-4a30-8b87-6f2fc1834ecb
[INFO] Executing test with workload [vectorsearch], test_procedure [bulk-procedure] and provision_config_instance ['external'] with version [3.0.0-SNAPSHOT].

[WARNING] merges_total_time is 16 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] indexing_total_time is 7 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 63 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] flush_total_time is 120 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running delete-target-index                                                    [100% done]
Running create-target-index                                                    [100% done]
Running bulk                                                                   [100% done]
Running refresh-target-index                                                   [100% done]
Running warmup-indices                                                         [100% done]
Running prod-queries                                                           [100% done]

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
            
|                                                         Metric |           Task |       Value |   Unit |
|---------------------------------------------------------------:|---------------:|------------:|-------:|
|                     Cumulative indexing time of primary shards |                |   0.0371833 |    min |
|             Min cumulative indexing time across primary shards |                |           0 |    min |
|          Median cumulative indexing time across primary shards |                | 0.000116667 |    min |
|             Max cumulative indexing time across primary shards |                |   0.0370667 |    min |
|            Cumulative indexing throttle time of primary shards |                |           0 |    min |
|    Min cumulative indexing throttle time across primary shards |                |           0 |    min |
| Median cumulative indexing throttle time across primary shards |                |           0 |    min |
|    Max cumulative indexing throttle time across primary shards |                |           0 |    min |
|                        Cumulative merge time of primary shards |                | 0.000266667 |    min |
|                       Cumulative merge count of primary shards |                |           1 |        |
|                Min cumulative merge time across primary shards |                |           0 |    min |
|             Median cumulative merge time across primary shards |                |           0 |    min |
|                Max cumulative merge time across primary shards |                | 0.000266667 |    min |
|               Cumulative merge throttle time of primary shards |                |           0 |    min |
|       Min cumulative merge throttle time across primary shards |                |           0 |    min |
|    Median cumulative merge throttle time across primary shards |                |           0 |    min |
|       Max cumulative merge throttle time across primary shards |                |           0 |    min |
|                      Cumulative refresh time of primary shards |                |  0.00468333 |    min |
|                     Cumulative refresh count of primary shards |                |          12 |        |
|              Min cumulative refresh time across primary shards |                |           0 |    min |
|           Median cumulative refresh time across primary shards |                |     0.00105 |    min |
|              Max cumulative refresh time across primary shards |                |  0.00363333 |    min |
|                        Cumulative flush time of primary shards |                |       0.002 |    min |
|                       Cumulative flush count of primary shards |                |           2 |        |
|                Min cumulative flush time across primary shards |                |           0 |    min |
|             Median cumulative flush time across primary shards |                |           0 |    min |
|                Max cumulative flush time across primary shards |                |       0.002 |    min |
|                                        Total Young Gen GC time |                |        0.01 |      s |
|                                       Total Young Gen GC count |                |           1 |        |
|                                          Total Old Gen GC time |                |           0 |      s |
|                                         Total Old Gen GC count |                |           0 |        |
|                                                     Store size |                |   0.0173898 |     GB |
|                                                  Translog size |                |   0.0150675 |     GB |
|                                         Heap used for segments |                |           0 |     MB |
|                                       Heap used for doc values |                |           0 |     MB |
|                                            Heap used for terms |                |           0 |     MB |
|                                            Heap used for norms |                |           0 |     MB |
|                                           Heap used for points |                |           0 |     MB |
|                                    Heap used for stored fields |                |           0 |     MB |
|                                                  Segment count |                |          10 |        |
|                                                 Min Throughput |           bulk |     1640.19 | docs/s |
|                                                Mean Throughput |           bulk |     1640.19 | docs/s |
|                                              Median Throughput |           bulk |     1640.19 | docs/s |
|                                                 Max Throughput |           bulk |     1640.19 | docs/s |
|                                        50th percentile latency |           bulk |     17.3579 |     ms |
|                                        90th percentile latency |           bulk |     45.1002 |     ms |
|                                        99th percentile latency |           bulk |     83.7313 |     ms |
|                                       100th percentile latency |           bulk |      88.521 |     ms |
|                                   50th percentile service time |           bulk |     17.3579 |     ms |
|                                   90th percentile service time |           bulk |     45.1002 |     ms |
|                                   99th percentile service time |           bulk |     83.7313 |     ms |
|                                  100th percentile service time |           bulk |      88.521 |     ms |
|                                                     error rate |           bulk |           0 |      % |
|                                                 Min Throughput | warmup-indices |       36.24 |  ops/s |
|                                                Mean Throughput | warmup-indices |       36.24 |  ops/s |
|                                              Median Throughput | warmup-indices |       36.24 |  ops/s |
|                                                 Max Throughput | warmup-indices |       36.24 |  ops/s |
|                                       100th percentile latency | warmup-indices |     27.4253 |     ms |
|                                  100th percentile service time | warmup-indices |     27.4253 |     ms |
|                                                     error rate | warmup-indices |           0 |      % |
|                                                 Min Throughput |   prod-queries |       149.9 |  ops/s |
|                                                Mean Throughput |   prod-queries |       149.9 |  ops/s |
|                                              Median Throughput |   prod-queries |       149.9 |  ops/s |
|                                                 Max Throughput |   prod-queries |       149.9 |  ops/s |
|                                        50th percentile latency |   prod-queries |     3.36225 |     ms |
|                                        90th percentile latency |   prod-queries |      4.6824 |     ms |
|                                        99th percentile latency |   prod-queries |     58.3903 |     ms |
|                                       100th percentile latency |   prod-queries |     109.023 |     ms |
|                                   50th percentile service time |   prod-queries |     3.36225 |     ms |
|                                   90th percentile service time |   prod-queries |      4.6824 |     ms |
|                                   99th percentile service time |   prod-queries |     58.3903 |     ms |
|                                  100th percentile service time |   prod-queries |     109.023 |     ms |
|                                                     error rate |   prod-queries |           0 |      % |
|                                                  Mean recall@k |   prod-queries |        0.37 |        |
|                                                  Mean recall@1 |   prod-queries |        0.07 |        |


--------------------------------
[INFO] SUCCESS (took 63 seconds)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Aug 17 '24 00:08 finnroblin

opensearch-benchmark opensearch-benchmark copied to clipboard

Support hdf5 files in bulk operation

Description

Testing

opensearch-benchmark
opensearch-benchmark copied to clipboard