milvus [Bug]: Flush failed due to `etcdserver: request timed out` after standalone pod kill chaos test

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version:2.2.0-20230310-b2ece6a5
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):kafka    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2023-03-10T23:28:40.470Z] [2023-03-10 23:26:23 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:56)

[2023-03-10T23:28:40.470Z] [2023-03-10 23:26:33 - ERROR - pymilvus.decorators]: RPC error: [flush], <MilvusException: (code=1, message=failed to flush 440006098629104321, etcdserver: request timed out)>, <Time:{'RPC start': '2023-03-10 23:26:23.189213', 'RPC error': '2023-03-10 23:26:33.193914'}> (decorators.py:108)

[2023-03-10T23:28:40.470Z] [2023-03-10 23:26:33 - ERROR - ci_test]: Traceback (most recent call last):

[2023-03-10T23:28:40.470Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2023-03-10T23:28:40.470Z]     res = func(*args, **_kwargs)

[2023-03-10T23:28:40.470Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2023-03-10T23:28:40.470Z]     return func(*arg, **kwargs)

[2023-03-10T23:28:40.470Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 280, in flush

[2023-03-10T23:28:40.470Z]     conn.flush([self.name], timeout=timeout, **kwargs)

[2023-03-10T23:28:40.470Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2023-03-10T23:28:40.470Z]     raise e

[2023-03-10T23:28:40.470Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2023-03-10T23:28:40.470Z]     return func(*args, **kwargs)

[2023-03-10T23:28:40.470Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2023-03-10T23:28:40.470Z]     ret = func(self, *args, **kwargs)

[2023-03-10T23:28:40.470Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2023-03-10T23:28:40.470Z]     raise e

[2023-03-10T23:28:40.470Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2023-03-10T23:28:40.470Z]     return func(self, *args, **kwargs)

[2023-03-10T23:28:40.470Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 881, in flush

[2023-03-10T23:28:40.470Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2023-03-10T23:28:40.470Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=failed to flush 440006098629104321, etcdserver: request timed out)>

[2023-03-10T23:28:40.470Z]  (api_request.py:39)

[2023-03-10T23:28:40.470Z] [2023-03-10 23:26:33 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=failed to flush 440006098629104321, etcdserver: request timed out)> (api_request.py:40)

[2023-03-10T23:28:40.470Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

[2023-03-10T23:28:40.470Z] =========================== short test summary info ============================

[2023-03-10T23:28:40.471Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__StjqnsUr] - AssertionError

[2023-03-10T23:28:40.471Z] =================== 1 failed, 10 passed in 145.22s (0:02:25) ===================

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

chaos type: pod-failure image tag: 2.2.0-20230310-b2ece6a5 target pod: standalone failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-for-release-cron/detail/chaos-test-kafka-for-release-cron/2596/pipeline log: artifacts-standalone-pod-failure-2596-server-logs.tar.gz artifacts-standalone-pod-failure-2596-pytest-logs.tar.gz

Anything else?

No response

Mar 13 '23 02:03 zhuwenxing

/assign @jiaoew1991 /unassign

Mar 13 '23 07:03 yanliang567

/assign @XuanYang-cn /unassign

Mar 23 '23 02:03 jiaoew1991

etcdserver: request timeout is raised by etcd server.

From etcd logs, most likely etcd is slow for disk io low.

Etcd is vulnerable to disk io. https://etcd.io/docs/v3.5/op-guide/hardware/#disks

Fast disks are the most critical factor for etcd deployment performance and stability. A slow disk will increase etcd request latency and potentially hurt cluster stability. Additionally, etcd will also incrementally checkpoint its state to disk so it can truncate this log. If these writes take too long, heartbeats may time out and trigger an election

Mar 24 '23 03:03 XuanYang-cn

@zhuwenxing is this happening again? /unassign /assign @zhuwenxing

Apr 17 '23 02:04 XuanYang-cn

Not reproduced in 2.2.0-20230417-52fb48a3

Apr 18 '23 02:04 zhuwenxing

milvus milvus copied to clipboard

[Bug]: Flush failed due to `etcdserver: request timed out` after standalone pod kill chaos test

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

milvus
milvus copied to clipboard