krkn
krkn copied to clipboard
Attempt to kill a single instance of etcd on SNO results in crash since the cluster is non-responsive while the pod reconciles.
How to reproduce:
Have a config with
chaos_scenarios: # List of policies/chaos scenarios to load - container_scenarios: # List of chaos pod scenarios to load - - scenarios/openshift/container_etcd.yml
` cat scenarios/openshift/container_etcd.yml scenarios:
- name: "kill etcd container" namespace: "openshift-etcd" label_selector: "k8s-app=etcd" container_name: "etcd" action: "kill 1" count: 1 expected_recovery_time: 60
`
Run python3.9 run_kraken.py --config config/kill-etcd.yaml
result:
`
_ _
| | ___ __ __ | | _____ _ __
| |/ / '__/ ` | |/ / _ \ ' \
| <| | | (| | < / | | |
||__| _,||__|| ||
2023-05-25 12:28:26,437 [INFO] Starting kraken
2023-05-25 12:28:26,449 [INFO] Initializing client to talk to the Kubernetes cluster
2023-05-25 12:28:29,884 [INFO] Publishing kraken status at http://0.0.0.0:8085
2023-05-25 12:28:29,885 [INFO] Publishing kraken status at http://0.0.0.0:8085
2023-05-25 12:28:29,886 [INFO] Starting http server at http://0.0.0.0:8085
2023-05-25 12:28:29,886 [INFO] Fetching cluster info
2023-05-25 12:28:29,894 [INFO] Cluster version is 4.13.0
2023-05-25 12:28:29,895 [INFO] Server URL: https://api.sno-0.qe.lab.redhat.com:6443
2023-05-25 12:28:29,895 [INFO] Generated a uuid for the run: 4c51a145-9664-4339-8735-a4a09da5d43f
2023-05-25 12:28:29,895 [INFO] Daemon mode not enabled, will run through 1 iterations
2023-05-25 12:28:29,895 [INFO] Executing scenarios for iteration 0
2023-05-25 12:28:29,895 [INFO] connection set up
127.0.0.1 - - [25/May/2023 12:28:29] "GET / HTTP/1.1" 200 -
2023-05-25 12:28:29,896 [INFO] response RUN
2023-05-25 12:28:29,897 [INFO] Running container scenarios
2023-05-25 12:28:30,798 [INFO] Killing container etcd in pod etcd-sno-0-0 (ns openshift-etcd)
2023-05-25 12:28:30,953 [INFO] Scenario kill etcd container successfully injected
\^[[3~^[[3~^[[3~2023-05-25 12:29:11,186 [WARNING] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnect
ed('Remote end closed connection without response'))': /api/v1/namespaces/openshift-etcd/pods?pretty=True
2023-05-25 12:29:11,234 [WARNING] Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connec
tion reset by peer'))': /api/v1/namespaces/openshift-etcd/pods?pretty=True
2023-05-25 12:29:11,236 [WARNING] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f
b9b062ce20>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/openshift-etcd/pods?pretty=True
Traceback (most recent call last):
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
conn = connection.create_connection(
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/util/connection.py", line 95, in create_connection
raise err
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 386, in _make_request
self._validate_conn(conn)
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
conn.connect()
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connection.py", line 363, in connect
self.sock = conn = self._new_conn()
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7fb9b062cc70>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/krkn/krkn/run_kraken.py", line 421, in
This is probably because the cluster can't be contacted while the etcd is restarted, but the app shouldn't crash
cc @tsebastiani
This one still reproduces in OCP 4.13.6