krkn icon indicating copy to clipboard operation
krkn copied to clipboard

Attempt to kill a single instance of etcd on SNO results in crash since the cluster is non-responsive while the pod reconciles.

Open achuzhoy opened this issue 1 year ago • 1 comments

How to reproduce:

Have a config with chaos_scenarios: # List of policies/chaos scenarios to load - container_scenarios: # List of chaos pod scenarios to load - - scenarios/openshift/container_etcd.yml

` cat scenarios/openshift/container_etcd.yml scenarios:

  • name: "kill etcd container" namespace: "openshift-etcd" label_selector: "k8s-app=etcd" container_name: "etcd" action: "kill 1" count: 1 expected_recovery_time: 60

`

Run python3.9 run_kraken.py --config config/kill-etcd.yaml

result:

`

_ _
| | ___ __ __ | | _____ _ __
| |/ / '__/ ` | |/ / _ \ ' \
| <| | | (
| | < / | | |
||__| _,||__
|| ||

2023-05-25 12:28:26,437 [INFO] Starting kraken
2023-05-25 12:28:26,449 [INFO] Initializing client to talk to the Kubernetes cluster
2023-05-25 12:28:29,884 [INFO] Publishing kraken status at http://0.0.0.0:8085
2023-05-25 12:28:29,885 [INFO] Publishing kraken status at http://0.0.0.0:8085
2023-05-25 12:28:29,886 [INFO] Starting http server at http://0.0.0.0:8085

2023-05-25 12:28:29,886 [INFO] Fetching cluster info
2023-05-25 12:28:29,894 [INFO] Cluster version is 4.13.0
2023-05-25 12:28:29,895 [INFO] Server URL: https://api.sno-0.qe.lab.redhat.com:6443
2023-05-25 12:28:29,895 [INFO] Generated a uuid for the run: 4c51a145-9664-4339-8735-a4a09da5d43f
2023-05-25 12:28:29,895 [INFO] Daemon mode not enabled, will run through 1 iterations

2023-05-25 12:28:29,895 [INFO] Executing scenarios for iteration 0
2023-05-25 12:28:29,895 [INFO] connection set up
127.0.0.1 - - [25/May/2023 12:28:29] "GET / HTTP/1.1" 200 -
2023-05-25 12:28:29,896 [INFO] response RUN
2023-05-25 12:28:29,897 [INFO] Running container scenarios
2023-05-25 12:28:30,798 [INFO] Killing container etcd in pod etcd-sno-0-0 (ns openshift-etcd)
2023-05-25 12:28:30,953 [INFO] Scenario kill etcd container successfully injected
\^[[3~^[[3~^[[3~2023-05-25 12:29:11,186 [WARNING] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnect ed('Remote end closed connection without response'))': /api/v1/namespaces/openshift-etcd/pods?pretty=True 2023-05-25 12:29:11,234 [WARNING] Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connec tion reset by peer'))': /api/v1/namespaces/openshift-etcd/pods?pretty=True 2023-05-25 12:29:11,236 [WARNING] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f b9b062ce20>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/openshift-etcd/pods?pretty=True Traceback (most recent call last):
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn conn = connection.create_connection(
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/util/connection.py", line 95, in create_connection raise err
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request( File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 386, in _make_request
self._validate_conn(conn) File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
conn.connect()
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connection.py", line 363, in connect
self.sock = conn = self._new_conn()
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7fb9b062cc70>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/root/krkn/krkn/run_kraken.py", line 421, in main(options.cfg) File "/root/krkn/krkn/run_kraken.py", line 218, in main failed_post_scenarios = pod_scenarios.container_run( File "/root/krkn/krkn/kraken/pod_scenarios/setup.py", line 92, in container_run failed_post_scenarios = check_failed_containers( File "/root/krkn/krkn/kraken/pod_scenarios/setup.py", line 191, in check_failed_containers pod_output = kubecli.get_pod_info(killed_container[0], killed_container[1]) File "/root/krkn/krkn/kraken/kubernetes/client.py", line 544, in get_pod_info pod_exists = check_if_pod_exists(name=name, namespace=namespace) File "/root/krkn/krkn/kraken/kubernetes/client.py", line 721, in check_if_pod_exists pod_list = list_pods(namespace=namespace) File "/root/krkn/krkn/kraken/kubernetes/client.py", line 209, in list_pods ret = cli.list_namespaced_pod(namespace, pretty=True) File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", line 15697, in list_namespaced_pod return self.list_namespaced_pod_with_http_info(namespace, **kwargs) # noqa: E501 File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", line 15812, in list_namespaced_pod_with_http_info return self.api_client.call_api( File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/kubernetes/client/api_client.py", line 373, in request return self.rest_client.GET(url, File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/kubernetes/client/rest.py", line 241, in GET return self.request("GET", url, File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/kubernetes/client/rest.py", line 214, in request r = self.pool_manager.request(method, url, File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/request.py", line 74, in request return self.request_encode_url( File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/request.py", line 96, in request_encode_url return self.urlopen(method, url, **extra_kw) File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/poolmanager.py", line 376, in urlopen response = conn.urlopen(method, u.request_uri, **kw) File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 815, in urlopen return self.urlopen( File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 815, in urlopen return self.urlopen( File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 815, in urlopen return self.urlopen( File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='api.sno-0.qe.lab.redhat.com', port=6443): Max retries exceeded with url: /api/v1/namespaces/openshift-etcd/pods?pretty=True (Caused by NewConnectionErr or('<urllib3.connection.HTTPSConnection object at 0x7fb9b062cc70>: Failed to establish a new connection: [Errno 111] Connection refused')) `

This is probably because the cluster can't be contacted while the etcd is restarted, but the app shouldn't crash

achuzhoy avatar May 25 '23 16:05 achuzhoy

cc @tsebastiani

This one still reproduces in OCP 4.13.6

achuzhoy avatar Aug 01 '23 16:08 achuzhoy