milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Search raise error `attempt #1:fail to get shard leaders from QueryCoord: no replica available` after upgrade or reinstall when using Kafka as MQ

Open zhuwenxing opened this issue 2 years ago • 10 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: v2.1.0 --> 2.1.0-20220830-a926a7d2
- Deployment mode(standalone or cluster): cluster with Kafka as MQ
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus==2.2.0.dev23
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

image

[2022-08-31T03:08:06.811Z] + python3 scripts/second_recall_test.py --host 10.101.55.165

[2022-08-31T03:08:18.964Z] RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2022-08-31T03:08:18.964Z] attempt #1:fail to get shard leaders from QueryCoord: no replica available

[2022-08-31T03:08:18.964Z] attempt #2:fail to get shard leaders from QueryCoord: no replica available

[2022-08-31T03:08:18.964Z] attempt #3:fail to get shard leaders from QueryCoord: no replica available

[2022-08-31T03:08:18.964Z] attempt #4:fail to get shard leaders from QueryCoord: no replica available

[2022-08-31T03:08:18.964Z] attempt #5:fail to get shard leaders from QueryCoord: no replica available

[2022-08-31T03:08:18.964Z] attempt #6:fail to get shard leaders from QueryCoord: no replica available

[2022-08-31T03:08:18.964Z] attempt #7:fail to get shard leaders from QueryCoord: no replica available

[2022-08-31T03:08:18.964Z] attempt #8:context deadline exceeded

[2022-08-31T03:08:18.964Z] )>, <Time:{'RPC start': '2022-08-31 03:08:08.083532', 'RPC error': '2022-08-31 03:08:18.308940'}>

[2022-08-31T03:08:18.964Z] 

[2022-08-31T03:08:18.964Z] Search...

[2022-08-31T03:08:18.964Z] Traceback (most recent call last):

[2022-08-31T03:08:18.964Z]   File "scripts/second_recall_test.py", line 64, in <module>

[2022-08-31T03:08:18.964Z]     search_test(host)

[2022-08-31T03:08:18.964Z]   File "scripts/second_recall_test.py", line 34, in search_test

[2022-08-31T03:08:18.964Z]     test[:nq], "float_vector", search_params, topK, output_fields=["int64"], timeout=TIMEOUT

[2022-08-31T03:08:18.964Z]   File "/usr/local/lib/python3.7/dist-packages/pymilvus/orm/collection.py", line 718, in search

[2022-08-31T03:08:18.964Z]     partition_names, output_fields, round_decimal, timeout=timeout, schema=schema_dict, **kwargs)

[2022-08-31T03:08:18.964Z]   File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 101, in handler

[2022-08-31T03:08:18.964Z]     raise e

[2022-08-31T03:08:18.964Z]   File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 97, in handler

[2022-08-31T03:08:18.964Z]     return func(*args, **kwargs)

[2022-08-31T03:08:18.964Z]   File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 127, in handler

[2022-08-31T03:08:18.964Z]     ret = func(self, *args, **kwargs)

[2022-08-31T03:08:18.964Z]   File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 79, in handler

[2022-08-31T03:08:18.964Z]     raise e

[2022-08-31T03:08:18.964Z]   File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 50, in handler

[2022-08-31T03:08:18.964Z]     return func(self, *args, **kwargs)

[2022-08-31T03:08:18.964Z]   File "/usr/local/lib/python3.7/dist-packages/pymilvus/client/grpc_handler.py", line 473, in search

[2022-08-31T03:08:18.964Z]     return self._execute_search_requests(requests, timeout, **_kwargs)

[2022-08-31T03:08:18.964Z]   File "/usr/local/lib/python3.7/dist-packages/pymilvus/client/grpc_handler.py", line 437, in _execute_search_requests

[2022-08-31T03:08:18.964Z]     raise pre_err

[2022-08-31T03:08:18.964Z]   File "/usr/local/lib/python3.7/dist-packages/pymilvus/client/grpc_handler.py", line 428, in _execute_search_requests

[2022-08-31T03:08:18.964Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2022-08-31T03:08:18.964Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2022-08-31T03:08:18.964Z] attempt #1:fail to get shard leaders from QueryCoord: no replica available

[2022-08-31T03:08:18.964Z] attempt #2:fail to get shard leaders from QueryCoord: no replica available

[2022-08-31T03:08:18.964Z] attempt #3:fail to get shard leaders from QueryCoord: no replica available

[2022-08-31T03:08:18.964Z] attempt #4:fail to get shard leaders from QueryCoord: no replica available

[2022-08-31T03:08:18.964Z] attempt #5:fail to get shard leaders from QueryCoord: no replica available

[2022-08-31T03:08:18.964Z] attempt #6:fail to get shard leaders from QueryCoord: no replica available

[2022-08-31T03:08:18.964Z] attempt #7:fail to get shard leaders from QueryCoord: no replica available

[2022-08-31T03:08:18.964Z] attempt #8:context deadline exceeded

[2022-08-31T03:08:18.964Z] )>

script returned exit code 1

Expected Behavior

all test case passedcases

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release/detail/deploy_test_kafka_for_release/9/pipeline

log: artifacts-cluster-upgrade-9-server-logs.tar.gz

artifacts-cluster-upgrade-9-pytest-logs.tar.gz

Anything else?

test script: https://github.com/milvus-io/milvus/blob/master/tests/python_client/deploy/scripts/second_recall_test.py collection name: sift_128_euclidean

zhuwenxing avatar Aug 31 '22 06:08 zhuwenxing

/assign @jiaoew1991 /unassign

yanliang567 avatar Aug 31 '22 11:08 yanliang567

@zhuwenxing is it the same root cause with reinstall/upgrade

yanliang567 avatar Aug 31 '22 11:08 yanliang567

@zhuwenxing is it the same root cause with reinstall/upgrade

Not sure, the error message is not the same, needs a further investigation by the dev.

zhuwenxing avatar Aug 31 '22 14:08 zhuwenxing

version: 2.1.0-20220902-853793a failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_for_release/detail/deploy_test_for_release/22/pipeline

log: artifacts-standalone-reinstall-22-pytest-logs.tar.gz

artifacts-standalone-reinstall-22-server-logs.tar.gz

zhuwenxing avatar Sep 05 '22 03:09 zhuwenxing

@zhuwenxing the log is not complete for second deploy. Also, milvus standalone crashed at very start point of second deploy:

[2022/09/05 03:05:47.281 +00:00] [DEBUG] [server/rocksmq_impl.go:157] ["Start rocksmq "] ["max proc"=64] [parallism=4] ["lru cache"=4294967296]
panic: IO error: While lock file: /var/lib/milvus/rdb_data_meta_kv/LOCK: Resource temporarily unavailable

goroutine 1 [running]:
github.com/milvus-io/milvus/cmd/roles.(*MilvusRoles).Run(0xc000c82090, 0xc000012001, 0x0, 0x0)
	/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:373 +0x11b5
github.com/milvus-io/milvus/cmd/milvus.(*run).execute(0xc0001f04e0, 0xc000050090, 0x3, 0x3, 0xc000666120)
	/go/src/github.com/milvus-io/milvus/cmd/milvus/run.go:111 +0x496
github.com/milvus-io/milvus/cmd/milvus.RunMilvus(0xc000050090, 0x3, 0x3)
	/go/src/github.com/milvus-io/milvus/cmd/milvus/milvus.go:60 +0x162
main.main()
	/go/src/github.com/milvus-io/milvus/cmd/main.go:26 +0x45

And the next several run succeed (run 27,28, 30,32). Run 33 is still running with two pipepline passed.

congqixia avatar Sep 05 '22 07:09 congqixia

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_for_release/detail/deploy_test_for_release/50/pipeline

image image

log: artifacts-cluster-upgrade-50-pytest-logs.tar.gz artifacts-cluster-upgrade-50-server-logs (1).tar.gz

zhuwenxing avatar Sep 06 '22 06:09 zhuwenxing

Kafka version , reinstall image version 2.1.0-20220908-ea0f57e failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release/detail/deploy_test_kafka_for_release/54/pipeline log: artifacts-cluster-reinstall-54-server-logs.tar.gz artifacts-cluster-reinstall-54-pytest-logs.tar.gz

image

image

zhuwenxing avatar Sep 08 '22 08:09 zhuwenxing

@congqixia @jiaoew1991 Please help to take a look for the latest failed job

zhuwenxing avatar Sep 08 '22 08:09 zhuwenxing

Kafka version cluster reinstall 2.1.0-20220913-3c3ba55 failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release/detail/deploy_test_kafka_for_release/96/pipeline/301 log: artifacts-cluster-reinstall-96-server-logs.tar.gz artifacts-cluster-reinstall-96-pytest-logs.tar.gz

image

zhuwenxing avatar Sep 15 '22 03:09 zhuwenxing

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release/detail/deploy_test_kafka_for_release/131/pipeline/303

log: artifacts-cluster-reinstall-131-server-logs.tar.gz artifacts-cluster-reinstall-131-pytest-logs.tar.gz

image

collection name: sift_128_euclidean

[2022-09-20T06:25:07.201Z] + python3 scripts/second_recall_test.py --host 10.101.4.244

[2022-09-20T06:25:19.337Z] RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2022-09-20T06:25:19.337Z] attempt #1:fail to get shard leaders from QueryCoord: no replica available

[2022-09-20T06:25:19.337Z] attempt #2:fail to get shard leaders from QueryCoord: no replica available

[2022-09-20T06:25:19.337Z] attempt #3:fail to get shard leaders from QueryCoord: no replica available

[2022-09-20T06:25:19.337Z] attempt #4:fail to get shard leaders from QueryCoord: no replica available

[2022-09-20T06:25:19.337Z] attempt #5:fail to get shard leaders from QueryCoord: no replica available

[2022-09-20T06:25:19.338Z] attempt #6:fail to get shard leaders from QueryCoord: no replica available

[2022-09-20T06:25:19.338Z] attempt #7:fail to get shard leaders from QueryCoord: no replica available

[2022-09-20T06:25:19.338Z] attempt #8:context deadline exceeded

[2022-09-20T06:25:19.338Z] )>, <Time:{'RPC start': '2022-09-20 06:25:08.447917', 'RPC error': '2022-09-20 06:25:18.591776'}>

[2022-09-20T06:25:19.338Z] 

[2022-09-20T06:25:19.338Z] Search...

[2022-09-20T06:25:19.338Z] Traceback (most recent call last):

[2022-09-20T06:25:19.338Z]   File "scripts/second_recall_test.py", line 64, in <module>

[2022-09-20T06:25:19.338Z]     search_test(host)

[2022-09-20T06:25:19.338Z]   File "scripts/second_recall_test.py", line 34, in search_test

[2022-09-20T06:25:19.338Z]     test[:nq], "float_vector", search_params, topK, output_fields=["int64"], timeout=TIMEOUT

[2022-09-20T06:25:19.338Z]   File "/usr/local/lib/python3.7/dist-packages/pymilvus/orm/collection.py", line 718, in search

[2022-09-20T06:25:19.338Z]     partition_names, output_fields, round_decimal, timeout=timeout, schema=schema_dict, **kwargs)

[2022-09-20T06:25:19.338Z]   File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 101, in handler

[2022-09-20T06:25:19.338Z]     raise e

[2022-09-20T06:25:19.338Z]   File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 97, in handler

[2022-09-20T06:25:19.338Z]     return func(*args, **kwargs)

[2022-09-20T06:25:19.338Z]   File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 127, in handler

[2022-09-20T06:25:19.338Z]     ret = func(self, *args, **kwargs)

[2022-09-20T06:25:19.338Z]   File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 79, in handler

[2022-09-20T06:25:19.338Z]     raise e

[2022-09-20T06:25:19.338Z]   File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 50, in handler

[2022-09-20T06:25:19.338Z]     return func(self, *args, **kwargs)

[2022-09-20T06:25:19.338Z]   File "/usr/local/lib/python3.7/dist-packages/pymilvus/client/grpc_handler.py", line 472, in search

[2022-09-20T06:25:19.338Z]     return self._execute_search_requests(requests, timeout, **_kwargs)

[2022-09-20T06:25:19.338Z]   File "/usr/local/lib/python3.7/dist-packages/pymilvus/client/grpc_handler.py", line 436, in _execute_search_requests

[2022-09-20T06:25:19.338Z]     raise pre_err

[2022-09-20T06:25:19.338Z]   File "/usr/local/lib/python3.7/dist-packages/pymilvus/client/grpc_handler.py", line 427, in _execute_search_requests

[2022-09-20T06:25:19.338Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2022-09-20T06:25:19.338Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2022-09-20T06:25:19.338Z] attempt #1:fail to get shard leaders from QueryCoord: no replica available

[2022-09-20T06:25:19.338Z] attempt #2:fail to get shard leaders from QueryCoord: no replica available

[2022-09-20T06:25:19.338Z] attempt #3:fail to get shard leaders from QueryCoord: no replica available

[2022-09-20T06:25:19.338Z] attempt #4:fail to get shard leaders from QueryCoord: no replica available

[2022-09-20T06:25:19.338Z] attempt #5:fail to get shard leaders from QueryCoord: no replica available

[2022-09-20T06:25:19.338Z] attempt #6:fail to get shard leaders from QueryCoord: no replica available

[2022-09-20T06:25:19.338Z] attempt #7:fail to get shard leaders from QueryCoord: no replica available

[2022-09-20T06:25:19.338Z] attempt #8:context deadline exceeded

[2022-09-20T06:25:19.338Z] )>

script returned exit code 1

zhuwenxing avatar Sep 20 '22 08:09 zhuwenxing

/assign @zhuwenxing /unassign

jiaoew1991 avatar Oct 11 '22 11:10 jiaoew1991

It was not reproduced in upgrading 2.1.4 to master or master reinstallation, so close it

zhuwenxing avatar Oct 11 '22 12:10 zhuwenxing