milvus
milvus copied to clipboard
[Bug]: Create index timeout after indexcoord pod kill chaos test
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version:master-20220624-8388478e
- Deployment mode(standalone or cluster):cluster
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus==2.1.0.dev87
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
2022-06-24T18:53:34.6622800Z [2022-06-24 18:52:54 - DEBUG - ci_test]: (api_request) : [Collection.create_index] args: ['float_vector', {'index_type': 'IVF_SQ8', 'params': {'nlist': 64}, 'metric_type': 'L2'}], kwargs: {'name': 'test_Lt8h2LmT', 'timeout': 40, 'index_name': '_default_idx'} (api_request.py:55)
2022-06-24T18:53:34.6623833Z [2022-06-24 18:53:34 - ERROR - pymilvus.decorators]: RPC error: [wait_for_creating_index], <MilvusException: (code=1, message=CreateIndex Timeout)>, <Time:{'RPC start': '2022-06-24 18:52:54.243495', 'RPC error': '2022-06-24 18:53:34.345417'}> (decorators.py:94)
2022-06-24T18:53:34.6624804Z [2022-06-24 18:53:34 - ERROR - pymilvus.decorators]: RPC error: [create_index], <MilvusException: (code=1, message=CreateIndex Timeout)>, <Time:{'RPC start': '2022-06-24 18:52:54.170661', 'RPC error': '2022-06-24 18:53:34.345748'}> (decorators.py:94)
2022-06-24T18:53:34.6625413Z [2022-06-24 18:53:34 - ERROR - ci_test]: Traceback (most recent call last):
2022-06-24T18:53:34.6625881Z File "/home/runner/work/milvus/milvus/tests/python_client/utils/api_request.py", line 22, in inner_wrapper
2022-06-24T18:53:34.6626238Z res = func(*args, **kwargs)
2022-06-24T18:53:34.6626626Z File "/home/runner/work/milvus/milvus/tests/python_client/utils/api_request.py", line 56, in api_request
2022-06-24T18:53:34.6626970Z return func(*arg, **kwargs)
2022-06-24T18:53:34.6627538Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 1077, in create_index
2022-06-24T18:53:34.6627944Z return conn.create_index(self._name, field_name, index_params,
2022-06-24T18:53:34.6628523Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 95, in handler
2022-06-24T18:53:34.6628955Z raise e
2022-06-24T18:53:34.6629462Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 91, in handler
2022-06-24T18:53:34.6629799Z return func(*args, **kwargs)
2022-06-24T18:53:34.6630321Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 73, in handler
2022-06-24T18:53:34.6630651Z raise e
2022-06-24T18:53:34.6631152Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 48, in handler
2022-06-24T18:53:34.6631490Z return func(self, *args, **kwargs)
2022-06-24T18:53:34.6632057Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 581, in create_index
2022-06-24T18:53:34.6632529Z index_success, fail_reason = self.wait_for_creating_index(collection_name=collection_name,
2022-06-24T18:53:34.6633132Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 95, in handler
2022-06-24T18:53:34.6633465Z raise e
2022-06-24T18:53:34.6633946Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 91, in handler
2022-06-24T18:53:34.6634306Z return func(*args, **kwargs)
2022-06-24T18:53:34.6634829Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 73, in handler
2022-06-24T18:53:34.6635157Z raise e
2022-06-24T18:53:34.6635717Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 48, in handler
2022-06-24T18:53:34.6636075Z return func(self, *args, **kwargs)
2022-06-24T18:53:34.6636692Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 654, in wait_for_creating_index
2022-06-24T18:53:34.6637148Z raise MilvusException(1, "CreateIndex Timeout")
2022-06-24T18:53:34.6637538Z pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=CreateIndex Timeout)>
2022-06-24T18:53:34.6637899Z (api_request.py:35)
2022-06-24T18:53:34.6638428Z [2022-06-24 18:53:34 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=CreateIndex Timeout)> (api_request.py:36)
Expected Behavior
all test cases passed
Steps To Reproduce
see https://github.com/milvus-io/milvus/runs/7046878180?check_suite_focus=true
Milvus Log
failed job: https://github.com/milvus-io/milvus/runs/7046878180?check_suite_focus=true log: https://github.com/milvus-io/milvus/suites/7082807023/artifacts/279862831
Anything else?
No response
It seems a stable issue another failed job: https://github.com/zhuwenxing/milvus/runs/7047015719?check_suite_focus=true log: https://github.com/zhuwenxing/milvus/suites/7082936798/artifacts/279864029
/assign @xiaocai2333
After more tests, it seems not a stable issue. the following tests all passed
- http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/7204/pipeline
- https://github.com/zhuwenxing/milvus/runs/7050946366?check_suite_focus=true
failed job: https://github.com/zhuwenxing/milvus/runs/7062698145?check_suite_focus=true log: https://github.com/zhuwenxing/milvus/suites/7097950042/artifacts/280939021
working on it~
When IndexCoord starts, it first load the meta information from ETCD, and then watch ETCD. As a result, If meta was changed during the period, IndexCoord can not perceive it.
pr #17818 will fix it.
@zhuwenxing please verify it~ /unassign /assign @zhuwenxing
version master-20220628-6b79d4f7
failed job: https://github.com/zhuwenxing/milvus/runs/7085026662?check_suite_focus=true
log: https://github.com/zhuwenxing/milvus/suites/7118659596/artifacts/282285078
/assign
some other pod kill chaos tests also have this issue.
- etcd pod kill
- failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/7414/pipeline
- log: artifacts-etcd-pod-kill-7414-server-logs.tar.gz
- querycoord pod kill
- failed job:http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/7411/pipeline
- log:artifacts-querycoord-pod-kill-7411-server-logs.tar.gz
we are working on a complete solution to fix
version master-20220707-543197c2
failed job: https://github.com/zhuwenxing/milvus/runs/7230471367?check_suite_focus=true log: https://github.com/zhuwenxing/milvus/suites/7254559572/artifacts/291573957
@xiaocai2333 Please take a look!
Milvus version: master-20220707-43ae6791
testcase: indexnode pod kill
failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test-kafka/detail/chaos-test-kafka/371/pipeline/61 log: artifacts-indexnode-pod-kill-371-server-logs.tar.gz
"Create index timeout" issue in nightly for branch 2.1.0, @xiaocai2333 Could you help to have a look if it is the similar issue? If not, I will create a new issue to track this, thanks.
https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-release-nightly/detail/milvus-release-nightly/140/pipeline/215
-
milvus log: milvus-distributed-pulsar-140-pymilvus-e2e-logs.tar.gz
-
collection name: index_csybvMK5
-
Failed time: [2022-07-07T23:51:23.329Z] [gw2] [ 22%] FAILED testcases/test_index.py::TestNewIndexBase::test_create_index_insert_flush[get_simple_index8]
-
Test log:
[2022-07-08T00:40:35.354Z] [2022-07-07 23:50:42 - DEBUG - ci_test]: (api_request) : [Collection.create_index] args: ['float_vector', {'index_type': 'RHNSW_SQ', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}], kwargs: {'timeout': 40, 'index_name': '_default_idx'} (api_request.py:55)
[2022-07-08T00:40:35.354Z] [2022-07-07 23:51:23 - ERROR - pymilvus.decorators]: RPC error: [wait_for_creating_index], <MilvusException: (code=1, message=CreateIndex Timeout)>, <Time:{'RPC start': '2022-07-07 23:50:43.000104', 'RPC error': '2022-07-07 23:51:23.058635'}> (decorators.py:78)
[2022-07-08T00:40:35.355Z] [2022-07-07 23:51:23 - ERROR - pymilvus.decorators]: RPC error: [create_index], <MilvusException: (code=1, message=CreateIndex Timeout)>, <Time:{'RPC start': '2022-07-07 23:50:42.982275', 'RPC error': '2022-07-07 23:51:23.058946'}> (decorators.py:78)
[2022-07-08T00:40:35.355Z] [2022-07-07 23:51:23 - ERROR - ci_test]: Traceback (most recent call last):
[2022-07-08T00:40:35.355Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 22, in inner_wrapper
[2022-07-08T00:40:35.355Z] res = func(*args, **kwargs)
[2022-07-08T00:40:35.355Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 56, in api_request
[2022-07-08T00:40:35.355Z] return func(*arg, **kwargs)
[2022-07-08T00:40:35.355Z] File "/usr/local/lib/python3.6/site-packages/pymilvus/orm/collection.py", line 1078, in create_index
[2022-07-08T00:40:35.355Z] timeout=timeout, **kwargs)
[2022-07-08T00:40:35.355Z] File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 61, in handler
[2022-07-08T00:40:35.355Z] raise e
[2022-07-08T00:40:35.355Z] File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 44, in handler
[2022-07-08T00:40:35.355Z] return func(self, *args, **kwargs)
[2022-07-08T00:40:35.355Z] File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 79, in handler
[2022-07-08T00:40:35.355Z] raise e
[2022-07-08T00:40:35.355Z] File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 75, in handler
[2022-07-08T00:40:35.355Z] return func(*args, **kwargs)
[2022-07-08T00:40:35.355Z] File "/usr/local/lib/python3.6/site-packages/pymilvus/client/grpc_handler.py", line 604, in create_index
[2022-07-08T00:40:35.355Z] timeout=timeout)
[2022-07-08T00:40:35.355Z] File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 61, in handler
[2022-07-08T00:40:35.355Z] raise e
[2022-07-08T00:40:35.355Z] File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 44, in handler
[2022-07-08T00:40:35.355Z] return func(self, *args, **kwargs)
[2022-07-08T00:40:35.355Z] File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 79, in handler
[2022-07-08T00:40:35.355Z] raise e
[2022-07-08T00:40:35.355Z] File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 75, in handler
[2022-07-08T00:40:35.355Z] return func(*args, **kwargs)
[2022-07-08T00:40:35.355Z] File "/usr/local/lib/python3.6/site-packages/pymilvus/client/grpc_handler.py", line 679, in wait_for_creating_index
[2022-07-08T00:40:35.355Z] raise MilvusException(1, "CreateIndex Timeout")
[2022-07-08T00:40:35.355Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=CreateIndex Timeout)>
[2022-07-08T00:40:35.355Z] (api_request.py:35)
[2022-07-08T00:40:35.355Z] [2022-07-07 23:51:23 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=CreateIndex Timeout)> (api_request.py:36)
I check the log, and the reason of this problem is accessing an object with a nil pointer. BTW, what is the timeout of this case? @binbinlv
fix it with #18168
I check the log, and the reason of this problem is accessing an object with a nil pointer. BTW, what is the timeout of this case? @binbinlv 40s
@zhuwenxing this bug has been fixed, please verify it.
version: master-20220715-f0846fb7
failed job: https://github.com/milvus-io/milvus/runs/7379802104?check_suite_focus=true
log: logs-indexnode-pod_kill
/assign @xiaocai2333 /unassign @zhuwenxing
version master-20220718-f7bd054a
failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/8433/pipeline
log: artifacts-indexnode-pod-kill-8433-server-logs.tar.gz
[2022/07/17 19:46:56.513 +00:00] [INFO] [indexcoord/node_manager.go:139] ["peek client success"] [nodeID=30]
[2022/07/17 19:46:56.516 +00:00] [INFO] [indexcoord/meta_table.go:309] ["IndexCoord metaTable UpdateVersion success"] [IndexBuildId=434657602713157634] [nodeID=30]
[2022/07/17 19:47:06.534 +00:00] [INFO] [indexcoord/node_manager.go:139] ["peek client success"] [nodeID=30]
[2022/07/17 19:47:06.536 +00:00] [INFO] [indexcoord/meta_table.go:309] ["IndexCoord metaTable UpdateVersion success"] [IndexBuildId=434657552735666178] [nodeID=30]
The scheduler's time is too long(10s), so the tasks not being allocated in time. I will reduce it.
I have tried five times with version master-20220719-db71c74c
, and this issue was not reproduced.
Remove the critical label.
awesome!
I have tried five times with version
master-20220719-db71c74c
, and this issue was not reproduced. Remove the critical label.
There still seems to be a possibility that this issue will happen failed job: https://github.com/zhuwenxing/milvus/runs/7443664307?check_suite_focus=true log: https://github.com/zhuwenxing/milvus/suites/7454035644/artifacts/305097701
There still seems to be a possibility that this issue will happen failed job: https://github.com/zhuwenxing/milvus/runs/7443664307?check_suite_focus=true log: https://github.com/zhuwenxing/milvus/suites/7454035644/artifacts/305097701
[2022/07/21 06:33:20.745 +00:00] [DEBUG] [indexcoord/index_coord.go:789] ["IndexCoord watchNodeLoop SessionAddEvent"] [serverID=20] [address=10.244.0.51:21121]
[2022/07/21 06:34:41.120 +00:00] [DEBUG] [indexcoord/index_coord.go:789] ["IndexCoord watchNodeLoop SessionAddEvent"] [serverID=21] [address=10.244.0.52:21121]
[2022/07/21 06:37:01.414 +00:00] [DEBUG] [indexcoord/index_coord.go:789] ["IndexCoord watchNodeLoop SessionAddEvent"] [serverID=22] [address=10.244.0.53:21121]
[2022/07/21 06:37:01.414 +00:00] [DEBUG] [indexcoord/index_coord.go:789] ["IndexCoord watchNodeLoop SessionAddEvent"] [serverID=23] [address=10.244.0.54:21121]
[2022/07/21 06:33:20.745 +00:00] [DEBUG] [indexcoord/index_coord.go:801] ["IndexCoord watchNodeLoop SessionDelEvent"] [serverID=20]
[2022/07/21 06:34:41.120 +00:00] [DEBUG] [indexcoord/index_coord.go:801] ["IndexCoord watchNodeLoop SessionDelEvent"] [serverID=21]
[2022/07/21 06:37:01.414 +00:00] [DEBUG] [indexcoord/index_coord.go:801] ["IndexCoord watchNodeLoop SessionDelEvent"] [serverID=22]
[2022/07/21 06:37:01.414 +00:00] [DEBUG] [indexcoord/index_coord.go:801] ["IndexCoord watchNodeLoop SessionDelEvent"] [serverID=23]
It seems that there is no IndexNode available in the final state, there should be a problem with deployment.
But in IndexNode's log, the serverID is 39.
[2022/07/21 06:34:21.541 +00:00] [DEBUG] [sessionutil/session_util.go:200] ["service begin to register to etcd"] [serverName=indexnode] [ServerID=39]
[2022/07/21 06:33:05.588 +00:00] [DEBUG] [sessionutil/session_util.go:384] ["watch services"] ["delete kv"="key:\"by-dev/meta/session/indexnode-23\" create_revision:1709 mod_revision:1709 version:1 value:\"{\\\"ServerID\\\":23,\\\"ServerName\\\":\\\"indexnode\\\",\\\"Address\\\":\\\"10.244.0.54:21121\\\",\\\"TriggerKill\\\":true}\" lease:9100792018510807504 "]
We will check the status of Milvus before running tests.
Indexnode is working well before running test_e2e.py