milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Flush failed due to `etcdserver: request timed out` after standalone pod kill chaos test

Open zhuwenxing opened this issue 1 year ago • 4 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:2.2.0-20230310-b2ece6a5
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):kafka    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2023-03-10T23:28:40.470Z] [2023-03-10 23:26:23 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:56)

[2023-03-10T23:28:40.470Z] [2023-03-10 23:26:33 - ERROR - pymilvus.decorators]: RPC error: [flush], <MilvusException: (code=1, message=failed to flush 440006098629104321, etcdserver: request timed out)>, <Time:{'RPC start': '2023-03-10 23:26:23.189213', 'RPC error': '2023-03-10 23:26:33.193914'}> (decorators.py:108)

[2023-03-10T23:28:40.470Z] [2023-03-10 23:26:33 - ERROR - ci_test]: Traceback (most recent call last):

[2023-03-10T23:28:40.470Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2023-03-10T23:28:40.470Z]     res = func(*args, **_kwargs)

[2023-03-10T23:28:40.470Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2023-03-10T23:28:40.470Z]     return func(*arg, **kwargs)

[2023-03-10T23:28:40.470Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 280, in flush

[2023-03-10T23:28:40.470Z]     conn.flush([self.name], timeout=timeout, **kwargs)

[2023-03-10T23:28:40.470Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2023-03-10T23:28:40.470Z]     raise e

[2023-03-10T23:28:40.470Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2023-03-10T23:28:40.470Z]     return func(*args, **kwargs)

[2023-03-10T23:28:40.470Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2023-03-10T23:28:40.470Z]     ret = func(self, *args, **kwargs)

[2023-03-10T23:28:40.470Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2023-03-10T23:28:40.470Z]     raise e

[2023-03-10T23:28:40.470Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2023-03-10T23:28:40.470Z]     return func(self, *args, **kwargs)

[2023-03-10T23:28:40.470Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 881, in flush

[2023-03-10T23:28:40.470Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2023-03-10T23:28:40.470Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=failed to flush 440006098629104321, etcdserver: request timed out)>

[2023-03-10T23:28:40.470Z]  (api_request.py:39)

[2023-03-10T23:28:40.470Z] [2023-03-10 23:26:33 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=failed to flush 440006098629104321, etcdserver: request timed out)> (api_request.py:40)

[2023-03-10T23:28:40.470Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

[2023-03-10T23:28:40.470Z] =========================== short test summary info ============================

[2023-03-10T23:28:40.471Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__StjqnsUr] - AssertionError

[2023-03-10T23:28:40.471Z] =================== 1 failed, 10 passed in 145.22s (0:02:25) ===================

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

chaos type: pod-failure image tag: 2.2.0-20230310-b2ece6a5 target pod: standalone failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-for-release-cron/detail/chaos-test-kafka-for-release-cron/2596/pipeline log: artifacts-standalone-pod-failure-2596-server-logs.tar.gz artifacts-standalone-pod-failure-2596-pytest-logs.tar.gz

Anything else?

No response

zhuwenxing avatar Mar 13 '23 02:03 zhuwenxing

/assign @jiaoew1991 /unassign

yanliang567 avatar Mar 13 '23 07:03 yanliang567

/assign @XuanYang-cn /unassign

jiaoew1991 avatar Mar 23 '23 02:03 jiaoew1991

image etcdserver: request timeout is raised by etcd server.

From etcd logs, most likely etcd is slow for disk io low.

Etcd is vulnerable to disk io. https://etcd.io/docs/v3.5/op-guide/hardware/#disks

Fast disks are the most critical factor for etcd deployment performance and stability. A slow disk will increase etcd request latency and potentially hurt cluster stability. Additionally, etcd will also incrementally checkpoint its state to disk so it can truncate this log. If these writes take too long, heartbeats may time out and trigger an election

XuanYang-cn avatar Mar 24 '23 03:03 XuanYang-cn

@zhuwenxing is this happening again? /unassign /assign @zhuwenxing

XuanYang-cn avatar Apr 17 '23 02:04 XuanYang-cn

Not reproduced in 2.2.0-20230417-52fb48a3

zhuwenxing avatar Apr 18 '23 02:04 zhuwenxing