milvus
milvus copied to clipboard
[Bug]: Search raise error `attempt #1:fail to get shard leaders from QueryCoord: no replica available` after upgrade or reinstall when using Kafka as MQ
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: v2.1.0 --> 2.1.0-20220830-a926a7d2
- Deployment mode(standalone or cluster): cluster with Kafka as MQ
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus==2.2.0.dev23
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
[2022-08-31T03:08:06.811Z] + python3 scripts/second_recall_test.py --host 10.101.55.165
[2022-08-31T03:08:18.964Z] RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:
[2022-08-31T03:08:18.964Z] attempt #1:fail to get shard leaders from QueryCoord: no replica available
[2022-08-31T03:08:18.964Z] attempt #2:fail to get shard leaders from QueryCoord: no replica available
[2022-08-31T03:08:18.964Z] attempt #3:fail to get shard leaders from QueryCoord: no replica available
[2022-08-31T03:08:18.964Z] attempt #4:fail to get shard leaders from QueryCoord: no replica available
[2022-08-31T03:08:18.964Z] attempt #5:fail to get shard leaders from QueryCoord: no replica available
[2022-08-31T03:08:18.964Z] attempt #6:fail to get shard leaders from QueryCoord: no replica available
[2022-08-31T03:08:18.964Z] attempt #7:fail to get shard leaders from QueryCoord: no replica available
[2022-08-31T03:08:18.964Z] attempt #8:context deadline exceeded
[2022-08-31T03:08:18.964Z] )>, <Time:{'RPC start': '2022-08-31 03:08:08.083532', 'RPC error': '2022-08-31 03:08:18.308940'}>
[2022-08-31T03:08:18.964Z]
[2022-08-31T03:08:18.964Z] Search...
[2022-08-31T03:08:18.964Z] Traceback (most recent call last):
[2022-08-31T03:08:18.964Z] File "scripts/second_recall_test.py", line 64, in <module>
[2022-08-31T03:08:18.964Z] search_test(host)
[2022-08-31T03:08:18.964Z] File "scripts/second_recall_test.py", line 34, in search_test
[2022-08-31T03:08:18.964Z] test[:nq], "float_vector", search_params, topK, output_fields=["int64"], timeout=TIMEOUT
[2022-08-31T03:08:18.964Z] File "/usr/local/lib/python3.7/dist-packages/pymilvus/orm/collection.py", line 718, in search
[2022-08-31T03:08:18.964Z] partition_names, output_fields, round_decimal, timeout=timeout, schema=schema_dict, **kwargs)
[2022-08-31T03:08:18.964Z] File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 101, in handler
[2022-08-31T03:08:18.964Z] raise e
[2022-08-31T03:08:18.964Z] File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 97, in handler
[2022-08-31T03:08:18.964Z] return func(*args, **kwargs)
[2022-08-31T03:08:18.964Z] File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 127, in handler
[2022-08-31T03:08:18.964Z] ret = func(self, *args, **kwargs)
[2022-08-31T03:08:18.964Z] File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 79, in handler
[2022-08-31T03:08:18.964Z] raise e
[2022-08-31T03:08:18.964Z] File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 50, in handler
[2022-08-31T03:08:18.964Z] return func(self, *args, **kwargs)
[2022-08-31T03:08:18.964Z] File "/usr/local/lib/python3.7/dist-packages/pymilvus/client/grpc_handler.py", line 473, in search
[2022-08-31T03:08:18.964Z] return self._execute_search_requests(requests, timeout, **_kwargs)
[2022-08-31T03:08:18.964Z] File "/usr/local/lib/python3.7/dist-packages/pymilvus/client/grpc_handler.py", line 437, in _execute_search_requests
[2022-08-31T03:08:18.964Z] raise pre_err
[2022-08-31T03:08:18.964Z] File "/usr/local/lib/python3.7/dist-packages/pymilvus/client/grpc_handler.py", line 428, in _execute_search_requests
[2022-08-31T03:08:18.964Z] raise MilvusException(response.status.error_code, response.status.reason)
[2022-08-31T03:08:18.964Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:
[2022-08-31T03:08:18.964Z] attempt #1:fail to get shard leaders from QueryCoord: no replica available
[2022-08-31T03:08:18.964Z] attempt #2:fail to get shard leaders from QueryCoord: no replica available
[2022-08-31T03:08:18.964Z] attempt #3:fail to get shard leaders from QueryCoord: no replica available
[2022-08-31T03:08:18.964Z] attempt #4:fail to get shard leaders from QueryCoord: no replica available
[2022-08-31T03:08:18.964Z] attempt #5:fail to get shard leaders from QueryCoord: no replica available
[2022-08-31T03:08:18.964Z] attempt #6:fail to get shard leaders from QueryCoord: no replica available
[2022-08-31T03:08:18.964Z] attempt #7:fail to get shard leaders from QueryCoord: no replica available
[2022-08-31T03:08:18.964Z] attempt #8:context deadline exceeded
[2022-08-31T03:08:18.964Z] )>
script returned exit code 1
Expected Behavior
all test case passedcases
Steps To Reproduce
No response
Milvus Log
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release/detail/deploy_test_kafka_for_release/9/pipeline
log: artifacts-cluster-upgrade-9-server-logs.tar.gz
artifacts-cluster-upgrade-9-pytest-logs.tar.gz
Anything else?
test script: https://github.com/milvus-io/milvus/blob/master/tests/python_client/deploy/scripts/second_recall_test.py collection name: sift_128_euclidean
/assign @jiaoew1991 /unassign
@zhuwenxing is it the same root cause with reinstall/upgrade
@zhuwenxing is it the same root cause with reinstall/upgrade
Not sure, the error message is not the same, needs a further investigation by the dev.
version: 2.1.0-20220902-853793a failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_for_release/detail/deploy_test_for_release/22/pipeline
@zhuwenxing the log is not complete for second deploy. Also, milvus standalone crashed at very start point of second deploy:
[2022/09/05 03:05:47.281 +00:00] [DEBUG] [server/rocksmq_impl.go:157] ["Start rocksmq "] ["max proc"=64] [parallism=4] ["lru cache"=4294967296]
panic: IO error: While lock file: /var/lib/milvus/rdb_data_meta_kv/LOCK: Resource temporarily unavailable
goroutine 1 [running]:
github.com/milvus-io/milvus/cmd/roles.(*MilvusRoles).Run(0xc000c82090, 0xc000012001, 0x0, 0x0)
/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:373 +0x11b5
github.com/milvus-io/milvus/cmd/milvus.(*run).execute(0xc0001f04e0, 0xc000050090, 0x3, 0x3, 0xc000666120)
/go/src/github.com/milvus-io/milvus/cmd/milvus/run.go:111 +0x496
github.com/milvus-io/milvus/cmd/milvus.RunMilvus(0xc000050090, 0x3, 0x3)
/go/src/github.com/milvus-io/milvus/cmd/milvus/milvus.go:60 +0x162
main.main()
/go/src/github.com/milvus-io/milvus/cmd/main.go:26 +0x45
And the next several run succeed (run 27,28, 30,32). Run 33 is still running with two pipepline passed.
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_for_release/detail/deploy_test_for_release/50/pipeline
log: artifacts-cluster-upgrade-50-pytest-logs.tar.gz artifacts-cluster-upgrade-50-server-logs (1).tar.gz
Kafka version , reinstall
image version 2.1.0-20220908-ea0f57e
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release/detail/deploy_test_kafka_for_release/54/pipeline
log:
artifacts-cluster-reinstall-54-server-logs.tar.gz
artifacts-cluster-reinstall-54-pytest-logs.tar.gz
@congqixia @jiaoew1991 Please help to take a look for the latest failed job
Kafka version
cluster reinstall 2.1.0-20220913-3c3ba55
failed job:
https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release/detail/deploy_test_kafka_for_release/96/pipeline/301
log:
artifacts-cluster-reinstall-96-server-logs.tar.gz
artifacts-cluster-reinstall-96-pytest-logs.tar.gz
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release/detail/deploy_test_kafka_for_release/131/pipeline/303
log: artifacts-cluster-reinstall-131-server-logs.tar.gz artifacts-cluster-reinstall-131-pytest-logs.tar.gz
collection name: sift_128_euclidean
[2022-09-20T06:25:07.201Z] + python3 scripts/second_recall_test.py --host 10.101.4.244
[2022-09-20T06:25:19.337Z] RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:
[2022-09-20T06:25:19.337Z] attempt #1:fail to get shard leaders from QueryCoord: no replica available
[2022-09-20T06:25:19.337Z] attempt #2:fail to get shard leaders from QueryCoord: no replica available
[2022-09-20T06:25:19.337Z] attempt #3:fail to get shard leaders from QueryCoord: no replica available
[2022-09-20T06:25:19.337Z] attempt #4:fail to get shard leaders from QueryCoord: no replica available
[2022-09-20T06:25:19.337Z] attempt #5:fail to get shard leaders from QueryCoord: no replica available
[2022-09-20T06:25:19.338Z] attempt #6:fail to get shard leaders from QueryCoord: no replica available
[2022-09-20T06:25:19.338Z] attempt #7:fail to get shard leaders from QueryCoord: no replica available
[2022-09-20T06:25:19.338Z] attempt #8:context deadline exceeded
[2022-09-20T06:25:19.338Z] )>, <Time:{'RPC start': '2022-09-20 06:25:08.447917', 'RPC error': '2022-09-20 06:25:18.591776'}>
[2022-09-20T06:25:19.338Z]
[2022-09-20T06:25:19.338Z] Search...
[2022-09-20T06:25:19.338Z] Traceback (most recent call last):
[2022-09-20T06:25:19.338Z] File "scripts/second_recall_test.py", line 64, in <module>
[2022-09-20T06:25:19.338Z] search_test(host)
[2022-09-20T06:25:19.338Z] File "scripts/second_recall_test.py", line 34, in search_test
[2022-09-20T06:25:19.338Z] test[:nq], "float_vector", search_params, topK, output_fields=["int64"], timeout=TIMEOUT
[2022-09-20T06:25:19.338Z] File "/usr/local/lib/python3.7/dist-packages/pymilvus/orm/collection.py", line 718, in search
[2022-09-20T06:25:19.338Z] partition_names, output_fields, round_decimal, timeout=timeout, schema=schema_dict, **kwargs)
[2022-09-20T06:25:19.338Z] File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 101, in handler
[2022-09-20T06:25:19.338Z] raise e
[2022-09-20T06:25:19.338Z] File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 97, in handler
[2022-09-20T06:25:19.338Z] return func(*args, **kwargs)
[2022-09-20T06:25:19.338Z] File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 127, in handler
[2022-09-20T06:25:19.338Z] ret = func(self, *args, **kwargs)
[2022-09-20T06:25:19.338Z] File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 79, in handler
[2022-09-20T06:25:19.338Z] raise e
[2022-09-20T06:25:19.338Z] File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 50, in handler
[2022-09-20T06:25:19.338Z] return func(self, *args, **kwargs)
[2022-09-20T06:25:19.338Z] File "/usr/local/lib/python3.7/dist-packages/pymilvus/client/grpc_handler.py", line 472, in search
[2022-09-20T06:25:19.338Z] return self._execute_search_requests(requests, timeout, **_kwargs)
[2022-09-20T06:25:19.338Z] File "/usr/local/lib/python3.7/dist-packages/pymilvus/client/grpc_handler.py", line 436, in _execute_search_requests
[2022-09-20T06:25:19.338Z] raise pre_err
[2022-09-20T06:25:19.338Z] File "/usr/local/lib/python3.7/dist-packages/pymilvus/client/grpc_handler.py", line 427, in _execute_search_requests
[2022-09-20T06:25:19.338Z] raise MilvusException(response.status.error_code, response.status.reason)
[2022-09-20T06:25:19.338Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:
[2022-09-20T06:25:19.338Z] attempt #1:fail to get shard leaders from QueryCoord: no replica available
[2022-09-20T06:25:19.338Z] attempt #2:fail to get shard leaders from QueryCoord: no replica available
[2022-09-20T06:25:19.338Z] attempt #3:fail to get shard leaders from QueryCoord: no replica available
[2022-09-20T06:25:19.338Z] attempt #4:fail to get shard leaders from QueryCoord: no replica available
[2022-09-20T06:25:19.338Z] attempt #5:fail to get shard leaders from QueryCoord: no replica available
[2022-09-20T06:25:19.338Z] attempt #6:fail to get shard leaders from QueryCoord: no replica available
[2022-09-20T06:25:19.338Z] attempt #7:fail to get shard leaders from QueryCoord: no replica available
[2022-09-20T06:25:19.338Z] attempt #8:context deadline exceeded
[2022-09-20T06:25:19.338Z] )>
script returned exit code 1
/assign @zhuwenxing /unassign
It was not reproduced in upgrading 2.1.4 to master or master reinstallation, so close it