milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Create index timeout after indexcoord pod kill chaos test

Open zhuwenxing opened this issue 2 years ago • 30 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:master-20220624-8388478e
- Deployment mode(standalone or cluster):cluster
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus==2.1.0.dev87
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

2022-06-24T18:53:34.6622800Z [2022-06-24 18:52:54 - DEBUG - ci_test]: (api_request)  : [Collection.create_index] args: ['float_vector', {'index_type': 'IVF_SQ8', 'params': {'nlist': 64}, 'metric_type': 'L2'}], kwargs: {'name': 'test_Lt8h2LmT', 'timeout': 40, 'index_name': '_default_idx'} (api_request.py:55)
2022-06-24T18:53:34.6623833Z [2022-06-24 18:53:34 - ERROR - pymilvus.decorators]: RPC error: [wait_for_creating_index], <MilvusException: (code=1, message=CreateIndex Timeout)>, <Time:{'RPC start': '2022-06-24 18:52:54.243495', 'RPC error': '2022-06-24 18:53:34.345417'}> (decorators.py:94)
2022-06-24T18:53:34.6624804Z [2022-06-24 18:53:34 - ERROR - pymilvus.decorators]: RPC error: [create_index], <MilvusException: (code=1, message=CreateIndex Timeout)>, <Time:{'RPC start': '2022-06-24 18:52:54.170661', 'RPC error': '2022-06-24 18:53:34.345748'}> (decorators.py:94)
2022-06-24T18:53:34.6625413Z [2022-06-24 18:53:34 - ERROR - ci_test]: Traceback (most recent call last):
2022-06-24T18:53:34.6625881Z   File "/home/runner/work/milvus/milvus/tests/python_client/utils/api_request.py", line 22, in inner_wrapper
2022-06-24T18:53:34.6626238Z     res = func(*args, **kwargs)
2022-06-24T18:53:34.6626626Z   File "/home/runner/work/milvus/milvus/tests/python_client/utils/api_request.py", line 56, in api_request
2022-06-24T18:53:34.6626970Z     return func(*arg, **kwargs)
2022-06-24T18:53:34.6627538Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 1077, in create_index
2022-06-24T18:53:34.6627944Z     return conn.create_index(self._name, field_name, index_params,
2022-06-24T18:53:34.6628523Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 95, in handler
2022-06-24T18:53:34.6628955Z     raise e
2022-06-24T18:53:34.6629462Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 91, in handler
2022-06-24T18:53:34.6629799Z     return func(*args, **kwargs)
2022-06-24T18:53:34.6630321Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 73, in handler
2022-06-24T18:53:34.6630651Z     raise e
2022-06-24T18:53:34.6631152Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 48, in handler
2022-06-24T18:53:34.6631490Z     return func(self, *args, **kwargs)
2022-06-24T18:53:34.6632057Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 581, in create_index
2022-06-24T18:53:34.6632529Z     index_success, fail_reason = self.wait_for_creating_index(collection_name=collection_name,
2022-06-24T18:53:34.6633132Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 95, in handler
2022-06-24T18:53:34.6633465Z     raise e
2022-06-24T18:53:34.6633946Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 91, in handler
2022-06-24T18:53:34.6634306Z     return func(*args, **kwargs)
2022-06-24T18:53:34.6634829Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 73, in handler
2022-06-24T18:53:34.6635157Z     raise e
2022-06-24T18:53:34.6635717Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 48, in handler
2022-06-24T18:53:34.6636075Z     return func(self, *args, **kwargs)
2022-06-24T18:53:34.6636692Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 654, in wait_for_creating_index
2022-06-24T18:53:34.6637148Z     raise MilvusException(1, "CreateIndex Timeout")
2022-06-24T18:53:34.6637538Z pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=CreateIndex Timeout)>
2022-06-24T18:53:34.6637899Z  (api_request.py:35)
2022-06-24T18:53:34.6638428Z [2022-06-24 18:53:34 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=CreateIndex Timeout)> (api_request.py:36)

Expected Behavior

all test cases passed

Steps To Reproduce

see https://github.com/milvus-io/milvus/runs/7046878180?check_suite_focus=true

Milvus Log

failed job: https://github.com/milvus-io/milvus/runs/7046878180?check_suite_focus=true log: https://github.com/milvus-io/milvus/suites/7082807023/artifacts/279862831

Anything else?

No response

zhuwenxing avatar Jun 25 '22 02:06 zhuwenxing

It seems a stable issue another failed job: https://github.com/zhuwenxing/milvus/runs/7047015719?check_suite_focus=true log: https://github.com/zhuwenxing/milvus/suites/7082936798/artifacts/279864029

zhuwenxing avatar Jun 25 '22 02:06 zhuwenxing

/assign @xiaocai2333

xiaofan-luan avatar Jun 25 '22 03:06 xiaofan-luan

After more tests, it seems not a stable issue. the following tests all passed

  • http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/7204/pipeline
  • https://github.com/zhuwenxing/milvus/runs/7050946366?check_suite_focus=true

zhuwenxing avatar Jun 25 '22 07:06 zhuwenxing

failed job: https://github.com/zhuwenxing/milvus/runs/7062698145?check_suite_focus=true log: https://github.com/zhuwenxing/milvus/suites/7097950042/artifacts/280939021

zhuwenxing avatar Jun 27 '22 02:06 zhuwenxing

working on it~

xiaocai2333 avatar Jun 27 '22 02:06 xiaocai2333

When IndexCoord starts, it first load the meta information from ETCD, and then watch ETCD. As a result, If meta was changed during the period, IndexCoord can not perceive it.

xiaocai2333 avatar Jun 27 '22 03:06 xiaocai2333

pr #17818 will fix it.

xiaocai2333 avatar Jun 27 '22 03:06 xiaocai2333

@zhuwenxing please verify it~ /unassign /assign @zhuwenxing

xiaocai2333 avatar Jun 28 '22 01:06 xiaocai2333

version master-20220628-6b79d4f7 failed job: https://github.com/zhuwenxing/milvus/runs/7085026662?check_suite_focus=true log: https://github.com/zhuwenxing/milvus/suites/7118659596/artifacts/282285078

zhuwenxing avatar Jun 28 '22 05:06 zhuwenxing

/assign

xiaocai2333 avatar Jun 28 '22 06:06 xiaocai2333

some other pod kill chaos tests also have this issue.

zhuwenxing avatar Jun 28 '22 07:06 zhuwenxing

we are working on a complete solution to fix

xiaofan-luan avatar Jul 01 '22 02:07 xiaofan-luan

version master-20220707-543197c2

failed job: https://github.com/zhuwenxing/milvus/runs/7230471367?check_suite_focus=true log: https://github.com/zhuwenxing/milvus/suites/7254559572/artifacts/291573957

@xiaocai2333 Please take a look!

zhuwenxing avatar Jul 07 '22 12:07 zhuwenxing

Milvus version: master-20220707-43ae6791 testcase: indexnode pod kill

failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test-kafka/detail/chaos-test-kafka/371/pipeline/61 log: artifacts-indexnode-pod-kill-371-server-logs.tar.gz

zhuwenxing avatar Jul 08 '22 02:07 zhuwenxing

"Create index timeout" issue in nightly for branch 2.1.0, @xiaocai2333 Could you help to have a look if it is the similar issue? If not, I will create a new issue to track this, thanks.

https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-release-nightly/detail/milvus-release-nightly/140/pipeline/215

  1. milvus log: milvus-distributed-pulsar-140-pymilvus-e2e-logs.tar.gz

  2. collection name: index_csybvMK5

  3. Failed time: [2022-07-07T23:51:23.329Z] [gw2] [ 22%] FAILED testcases/test_index.py::TestNewIndexBase::test_create_index_insert_flush[get_simple_index8]

  4. Test log:

[2022-07-08T00:40:35.354Z] [2022-07-07 23:50:42 - DEBUG - ci_test]: (api_request)  : [Collection.create_index] args: ['float_vector', {'index_type': 'RHNSW_SQ', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}], kwargs: {'timeout': 40, 'index_name': '_default_idx'} (api_request.py:55)

[2022-07-08T00:40:35.354Z] [2022-07-07 23:51:23 - ERROR - pymilvus.decorators]: RPC error: [wait_for_creating_index], <MilvusException: (code=1, message=CreateIndex Timeout)>, <Time:{'RPC start': '2022-07-07 23:50:43.000104', 'RPC error': '2022-07-07 23:51:23.058635'}> (decorators.py:78)

[2022-07-08T00:40:35.355Z] [2022-07-07 23:51:23 - ERROR - pymilvus.decorators]: RPC error: [create_index], <MilvusException: (code=1, message=CreateIndex Timeout)>, <Time:{'RPC start': '2022-07-07 23:50:42.982275', 'RPC error': '2022-07-07 23:51:23.058946'}> (decorators.py:78)

[2022-07-08T00:40:35.355Z] [2022-07-07 23:51:23 - ERROR - ci_test]: Traceback (most recent call last):

[2022-07-08T00:40:35.355Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 22, in inner_wrapper

[2022-07-08T00:40:35.355Z]     res = func(*args, **kwargs)

[2022-07-08T00:40:35.355Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 56, in api_request

[2022-07-08T00:40:35.355Z]     return func(*arg, **kwargs)

[2022-07-08T00:40:35.355Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/orm/collection.py", line 1078, in create_index

[2022-07-08T00:40:35.355Z]     timeout=timeout, **kwargs)

[2022-07-08T00:40:35.355Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 61, in handler

[2022-07-08T00:40:35.355Z]     raise e

[2022-07-08T00:40:35.355Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 44, in handler

[2022-07-08T00:40:35.355Z]     return func(self, *args, **kwargs)

[2022-07-08T00:40:35.355Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 79, in handler

[2022-07-08T00:40:35.355Z]     raise e

[2022-07-08T00:40:35.355Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 75, in handler

[2022-07-08T00:40:35.355Z]     return func(*args, **kwargs)

[2022-07-08T00:40:35.355Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/client/grpc_handler.py", line 604, in create_index

[2022-07-08T00:40:35.355Z]     timeout=timeout)

[2022-07-08T00:40:35.355Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 61, in handler

[2022-07-08T00:40:35.355Z]     raise e

[2022-07-08T00:40:35.355Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 44, in handler

[2022-07-08T00:40:35.355Z]     return func(self, *args, **kwargs)

[2022-07-08T00:40:35.355Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 79, in handler

[2022-07-08T00:40:35.355Z]     raise e

[2022-07-08T00:40:35.355Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 75, in handler

[2022-07-08T00:40:35.355Z]     return func(*args, **kwargs)

[2022-07-08T00:40:35.355Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/client/grpc_handler.py", line 679, in wait_for_creating_index

[2022-07-08T00:40:35.355Z]     raise MilvusException(1, "CreateIndex Timeout")

[2022-07-08T00:40:35.355Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=CreateIndex Timeout)>

[2022-07-08T00:40:35.355Z]  (api_request.py:35)

[2022-07-08T00:40:35.355Z] [2022-07-07 23:51:23 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=CreateIndex Timeout)> (api_request.py:36)

binbinlv avatar Jul 08 '22 02:07 binbinlv

I check the log, and the reason of this problem is accessing an object with a nil pointer. BTW, what is the timeout of this case? @binbinlv

xiaocai2333 avatar Jul 08 '22 08:07 xiaocai2333

fix it with #18168

xiaocai2333 avatar Jul 08 '22 08:07 xiaocai2333

I check the log, and the reason of this problem is accessing an object with a nil pointer. BTW, what is the timeout of this case? @binbinlv 40s

binbinlv avatar Jul 08 '22 09:07 binbinlv

@zhuwenxing this bug has been fixed, please verify it.

xiaocai2333 avatar Jul 13 '22 09:07 xiaocai2333

version: master-20220715-f0846fb7 failed job: https://github.com/milvus-io/milvus/runs/7379802104?check_suite_focus=true log: logs-indexnode-pod_kill

zhuwenxing avatar Jul 18 '22 01:07 zhuwenxing

/assign @xiaocai2333 /unassign @zhuwenxing

yanliang567 avatar Jul 19 '22 00:07 yanliang567

version master-20220718-f7bd054a failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/8433/pipeline log: artifacts-indexnode-pod-kill-8433-server-logs.tar.gz

zhuwenxing avatar Jul 19 '22 01:07 zhuwenxing

[2022/07/17 19:46:56.513 +00:00] [INFO] [indexcoord/node_manager.go:139] ["peek client success"] [nodeID=30]
[2022/07/17 19:46:56.516 +00:00] [INFO] [indexcoord/meta_table.go:309] ["IndexCoord metaTable UpdateVersion success"] [IndexBuildId=434657602713157634] [nodeID=30]
[2022/07/17 19:47:06.534 +00:00] [INFO] [indexcoord/node_manager.go:139] ["peek client success"] [nodeID=30]
[2022/07/17 19:47:06.536 +00:00] [INFO] [indexcoord/meta_table.go:309] ["IndexCoord metaTable UpdateVersion success"] [IndexBuildId=434657552735666178] [nodeID=30]

The scheduler's time is too long(10s), so the tasks not being allocated in time. I will reduce it.

xiaocai2333 avatar Jul 19 '22 02:07 xiaocai2333

I have tried five times with version master-20220719-db71c74c, and this issue was not reproduced. Remove the critical label.

zhuwenxing avatar Jul 19 '22 12:07 zhuwenxing

awesome!

I have tried five times with version master-20220719-db71c74c, and this issue was not reproduced. Remove the critical label.

yanliang567 avatar Jul 19 '22 13:07 yanliang567

There still seems to be a possibility that this issue will happen failed job: https://github.com/zhuwenxing/milvus/runs/7443664307?check_suite_focus=true log: https://github.com/zhuwenxing/milvus/suites/7454035644/artifacts/305097701

zhuwenxing avatar Jul 21 '22 07:07 zhuwenxing

There still seems to be a possibility that this issue will happen failed job: https://github.com/zhuwenxing/milvus/runs/7443664307?check_suite_focus=true log: https://github.com/zhuwenxing/milvus/suites/7454035644/artifacts/305097701

[2022/07/21 06:33:20.745 +00:00] [DEBUG] [indexcoord/index_coord.go:789] ["IndexCoord watchNodeLoop SessionAddEvent"] [serverID=20] [address=10.244.0.51:21121]
[2022/07/21 06:34:41.120 +00:00] [DEBUG] [indexcoord/index_coord.go:789] ["IndexCoord watchNodeLoop SessionAddEvent"] [serverID=21] [address=10.244.0.52:21121]
[2022/07/21 06:37:01.414 +00:00] [DEBUG] [indexcoord/index_coord.go:789] ["IndexCoord watchNodeLoop SessionAddEvent"] [serverID=22] [address=10.244.0.53:21121]
[2022/07/21 06:37:01.414 +00:00] [DEBUG] [indexcoord/index_coord.go:789] ["IndexCoord watchNodeLoop SessionAddEvent"] [serverID=23] [address=10.244.0.54:21121]
[2022/07/21 06:33:20.745 +00:00] [DEBUG] [indexcoord/index_coord.go:801] ["IndexCoord watchNodeLoop SessionDelEvent"] [serverID=20]
[2022/07/21 06:34:41.120 +00:00] [DEBUG] [indexcoord/index_coord.go:801] ["IndexCoord watchNodeLoop SessionDelEvent"] [serverID=21]
[2022/07/21 06:37:01.414 +00:00] [DEBUG] [indexcoord/index_coord.go:801] ["IndexCoord watchNodeLoop SessionDelEvent"] [serverID=22]
[2022/07/21 06:37:01.414 +00:00] [DEBUG] [indexcoord/index_coord.go:801] ["IndexCoord watchNodeLoop SessionDelEvent"] [serverID=23]

It seems that there is no IndexNode available in the final state, there should be a problem with deployment.

xiaocai2333 avatar Jul 21 '22 07:07 xiaocai2333

But in IndexNode's log, the serverID is 39.

[2022/07/21 06:34:21.541 +00:00] [DEBUG] [sessionutil/session_util.go:200] ["service begin to register to etcd"] [serverName=indexnode] [ServerID=39]

xiaocai2333 avatar Jul 21 '22 07:07 xiaocai2333

[2022/07/21 06:33:05.588 +00:00] [DEBUG] [sessionutil/session_util.go:384] ["watch services"] ["delete kv"="key:\"by-dev/meta/session/indexnode-23\" create_revision:1709 mod_revision:1709 version:1 value:\"{\\\"ServerID\\\":23,\\\"ServerName\\\":\\\"indexnode\\\",\\\"Address\\\":\\\"10.244.0.54:21121\\\",\\\"TriggerKill\\\":true}\" lease:9100792018510807504 "]

xiaocai2333 avatar Jul 21 '22 07:07 xiaocai2333

We will check the status of Milvus before running tests. Indexnode is working well before running test_e2e.py image

zhuwenxing avatar Jul 21 '22 08:07 zhuwenxing