milvus
milvus copied to clipboard
[Bug]: Flush failed due to `etcdserver: request timed out` after standalone pod kill chaos test
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version:2.2.0-20230310-b2ece6a5
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):kafka
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
[2023-03-10T23:28:40.470Z] [2023-03-10 23:26:23 - DEBUG - ci_test]: (api_request) : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:56)
[2023-03-10T23:28:40.470Z] [2023-03-10 23:26:33 - ERROR - pymilvus.decorators]: RPC error: [flush], <MilvusException: (code=1, message=failed to flush 440006098629104321, etcdserver: request timed out)>, <Time:{'RPC start': '2023-03-10 23:26:23.189213', 'RPC error': '2023-03-10 23:26:33.193914'}> (decorators.py:108)
[2023-03-10T23:28:40.470Z] [2023-03-10 23:26:33 - ERROR - ci_test]: Traceback (most recent call last):
[2023-03-10T23:28:40.470Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper
[2023-03-10T23:28:40.470Z] res = func(*args, **_kwargs)
[2023-03-10T23:28:40.470Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request
[2023-03-10T23:28:40.470Z] return func(*arg, **kwargs)
[2023-03-10T23:28:40.470Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 280, in flush
[2023-03-10T23:28:40.470Z] conn.flush([self.name], timeout=timeout, **kwargs)
[2023-03-10T23:28:40.470Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
[2023-03-10T23:28:40.470Z] raise e
[2023-03-10T23:28:40.470Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
[2023-03-10T23:28:40.470Z] return func(*args, **kwargs)
[2023-03-10T23:28:40.470Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
[2023-03-10T23:28:40.470Z] ret = func(self, *args, **kwargs)
[2023-03-10T23:28:40.470Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler
[2023-03-10T23:28:40.470Z] raise e
[2023-03-10T23:28:40.470Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
[2023-03-10T23:28:40.470Z] return func(self, *args, **kwargs)
[2023-03-10T23:28:40.470Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 881, in flush
[2023-03-10T23:28:40.470Z] raise MilvusException(response.status.error_code, response.status.reason)
[2023-03-10T23:28:40.470Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=failed to flush 440006098629104321, etcdserver: request timed out)>
[2023-03-10T23:28:40.470Z] (api_request.py:39)
[2023-03-10T23:28:40.470Z] [2023-03-10 23:26:33 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=failed to flush 440006098629104321, etcdserver: request timed out)> (api_request.py:40)
[2023-03-10T23:28:40.470Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------
[2023-03-10T23:28:40.470Z] =========================== short test summary info ============================
[2023-03-10T23:28:40.471Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__StjqnsUr] - AssertionError
[2023-03-10T23:28:40.471Z] =================== 1 failed, 10 passed in 145.22s (0:02:25) ===================
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
chaos type: pod-failure image tag: 2.2.0-20230310-b2ece6a5 target pod: standalone failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-for-release-cron/detail/chaos-test-kafka-for-release-cron/2596/pipeline log: artifacts-standalone-pod-failure-2596-server-logs.tar.gz artifacts-standalone-pod-failure-2596-pytest-logs.tar.gz
Anything else?
No response
/assign @jiaoew1991 /unassign
/assign @XuanYang-cn /unassign
etcdserver: request timeout is raised by etcd server.
From etcd logs, most likely etcd is slow for disk io low.
Etcd is vulnerable to disk io. https://etcd.io/docs/v3.5/op-guide/hardware/#disks
Fast disks are the most critical factor for etcd deployment performance and stability. A slow disk will increase etcd request latency and potentially hurt cluster stability. Additionally, etcd will also incrementally checkpoint its state to disk so it can truncate this log. If these writes take too long, heartbeats may time out and trigger an election
@zhuwenxing is this happening again? /unassign /assign @zhuwenxing
Not reproduced in 2.2.0-20230417-52fb48a3