vector-db-benchmark
vector-db-benchmark copied to clipboard
backoff strategy should be used for rate-limited errors on milvus or reducing batch_size config
It's recurrent to see the following type of errors on non-local setups:
pymilvus.exceptions.MilvusException: <MilvusException: (code=49, message=Retry run out of 10 retry times, message=request is rejected by grpc RateLimiter middleware, please retry later, req: /milvus.proto.milvus.MilvusService/Insert)>
Full traceback:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/ubuntu/vector-db-benchmark/engine/base_client/upload.py", line 90, in _upload_batch
cls.upload_batch(ids, vectors, metadata)
File "/home/ubuntu/vector-db-benchmark/engine/clients/milvus/upload.py", line 68, in upload_batch
cls.upload_with_backoff(field_values, ids, vectors)
File "/usr/local/lib/python3.10/dist-packages/backoff/_sync.py", line 105, in retry
ret = target(*args, **kwargs)
File "/home/ubuntu/vector-db-benchmark/engine/clients/milvus/upload.py", line 75, in upload_with_backoff
cls.collection.insert([ids, vectors] + field_values)
File "/usr/local/lib/python3.10/dist-packages/pymilvus/orm/collection.py", line 443, in insert
res = conn.batch_insert(self._name, entities, partition_name,
File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 109, in handler
raise e
File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 105, in handler
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 136, in handler
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 80, in handler
raise MilvusException(e.code, f"{timeout_msg}, message={e.message}") from e
pymilvus.exceptions.MilvusException: <MilvusException: (code=49, message=Retry run out of 10 retry times, message=request is rejected by grpc RateLimiter middleware, please retry later, req: /milvus.proto.milvus.MilvusService/Insert)>
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ubuntu/vector-db-benchmark/run.py", line 79, in <module>
app()
File "/home/ubuntu/vector-db-benchmark/run.py", line 74, in run
raise e
File "/home/ubuntu/vector-db-benchmark/run.py", line 52, in run
client.run_experiment(dataset, skip_upload, skip_search)
File "/home/ubuntu/vector-db-benchmark/engine/base_client/client.py", line 70, in run_experiment
upload_stats = self.uploader.upload(
File "/home/ubuntu/vector-db-benchmark/engine/base_client/upload.py", line 56, in upload
latencies = list(
File "/usr/lib/python3.10/multiprocessing/pool.py", line 873, in next
raise value
pymilvus.exceptions.MilvusException: <MilvusException: (code=49, message=Retry run out of 10 retry times, message=request is rejected by grpc RateLimiter middleware, please retry later, req: /milvus.proto.milvus.MilvusService/Insert)>
Given milvus configs dont specify the batch_size
we're using 64 vectors, which seems to be constantly making the error state above.
I suggest to either respect API Rate Limits With a Backoff or reduce the batch size.
errors on non-local setups
Are you testing Ziliz's cloud offering?
Hello, are you running the test on zilliz cloud? Can you provide the instance specifications you used?
Hello, are you running the test on zilliz cloud?
@wangting0128 yes.
Can you provide the instance specifications you used?
Sure. I've used the Dedicated Performance Optimized CU size 1 (issue happens on large CUs as well). I've confirmed yesterday it still happens:
MILVUS_USER="db_admin" MILVUS_PASS="<...>" MILVUS_PORT=<...> python3 run.py --engines milvus-m-* --datasets gist-960-euclidean --host <...>
(...)
(...)
Running experiment: milvus-m-16-ef-64 - gist-960-euclidean
established connection
/home/ubuntu/vector-db-benchmark/datasets/gist-960-euclidean/gist-960-euclidean.hdf5 already exists
Experiment stage: Configure
Experiment stage: Upload
644800it [09:51, 1120.07it/s][batch_insert] retry:8, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, Broken pipe>
649664it [09:55, 1204.76it/s][batch_insert] retry:9, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, Broken pipe>
1000000it [15:16, 1090.80it/s]
Upload time: 919.8683542869985
Total import time: 1126.062087302911
Experiment stage: Search
(...)
Notice that after around 10minutes of ingestion zilliz cloud "breaks" and we need 8 and 9 retries to complete that batch insert. I've preserved the full log of all variations in case we need it for the future.
@wangting0128 notice that I've added a backoff strategy capacity to the tool to ensure we can properly handle this issues and benchmark with the correct conditions. I'll open a PR just for the zilliz cloud benchmarking still today.
Hello, are you running the test on zilliz cloud?
@wangting0128 yes.
Can you provide the instance specifications you used?
Sure. I've used the Dedicated Performance Optimized CU size 1 (issue happens on large CUs as well). I've confirmed yesterday it still happens:
MILVUS_USER="db_admin" MILVUS_PASS="<...>" MILVUS_PORT=<...> python3 run.py --engines milvus-m-* --datasets gist-960-euclidean --host <...> (...) (...) Running experiment: milvus-m-16-ef-64 - gist-960-euclidean established connection /home/ubuntu/vector-db-benchmark/datasets/gist-960-euclidean/gist-960-euclidean.hdf5 already exists Experiment stage: Configure Experiment stage: Upload 644800it [09:51, 1120.07it/s][batch_insert] retry:8, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, Broken pipe> 649664it [09:55, 1204.76it/s][batch_insert] retry:9, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, Broken pipe> 1000000it [15:16, 1090.80it/s] Upload time: 919.8683542869985 Total import time: 1126.062087302911 Experiment stage: Search (...)
Notice that after around 10minutes of ingestion zilliz cloud "breaks" and we need 8 and 9 retries to complete that batch insert. I've preserved the full log of all variations in case we need it for the future.
@wangting0128 notice that I've added a backoff strategy capacity to the tool to ensure we can properly handle this issues and benchmark with the correct conditions. I'll open a PR just for the zilliz cloud benchmarking still today.
Hi, sorry for replying to your message now.
Based on your problem description, I have some information to share with you~:
- Index and search parameters used by zilliz cloud You can refer to the official documentation. Correspondingly, we have provided a PR with a set of configurations. Please review it.
PR: The provided configuration file only contains one set of
milvus-cloud
configuration, because I don't know whether you are running the same configuration on all datasets :> - There are insertion limits for instances of different specifications in zilliz cloud, so you may encounter insertion errors. For specific limit values, please refer to the documentation
- Based on the instance specifications described in blog, we recommend that you use Dedicated Performance Optimized CU size 4 of zilliz cloud :>
If you have any further questions, please feel free to contact us. Thank you very much~