python
python copied to clipboard
Unexpected WebSocketConnectionClosedException when streaming long running commands, missing pings?
What happened (please include outputs or screenshots):
After exactly 300sec of idle, exec commands streamed with stream
loose their connection, even though
- pod is still running,
- The process that was execd in container is still running (checked with
kubectl exec POD_NAME ps aux
) - Network otherwise appears to be healthy.
- No warnings logged by kube-apiserver
Following stack trace is given
File "kubernetes/stream/ws_client.py", line 141, in readline_stderr
return self.readline_channel(STDERR_CHANNEL, timeout=timeout)
File "kubernetes/stream/ws_client.py", line 104, in readline_channel
self.update(timeout=(timeout - time.time() + start))
File "kubernetes/stream/ws_client.py", line 192, in update
op_code, frame = self.sock.recv_data_frame(True)
File "websocket/_core.py", line 401, in recv_data_frame
frame = self.recv_frame()
File "websocket/_core.py", line 440, in recv_frame
return self.frame_buffer.recv_frame()
File "websocket/_abnf.py", line 338, in recv_frame
self.recv_header()
File "websocket/_abnf.py", line 294, in recv_header
header = self.recv_strict(2)
File "websocket/_abnf.py", line 373, in recv_strict
bytes_ = self.recv(min(16384, shortage))
File "websocket/_core.py", line 524, in _recv
return recv(self.sock, bufsize)
File "websocket/_socket.py", line 122, in recv
raise WebSocketConnectionClosedException(
websocket._exceptions.WebSocketConnectionClosedException: Connection to remote host was lost.
What you expected to happen:
- Connection is not lost
- If it is lost, connection is automatically reconnected (up to some configured timeout limit) Reading https://github.com/websocket-client/websocket-client#long-lived-connection, there is a description on how to use a built in way from the websocket library to do this. It appears that the stream command is not using this method.
How to reproduce it (as minimally and precisely as possible): (Mostly inspired by https://github.com/kubernetes-client/python/blob/master/examples/pod_exec.py)
from kubernetes.stream import stream
...
...
s = stream(
_client.connect_get_namespaced_pod_exec,
_pod.metadata.name,
_pod.metadata.namespace,
command=["sleep", "100000"],
async_req=False,
stderr=True,
stdin=False,
stdout=True,
tty=False,
_preload_content=False,
)
def tail_logs():
if (ret_stdout := s.readline_stdout(timeout=1)) is not None:
print(ret_stdout)
while (retcode := s.returncode) is None:
tail_logs(s)
tail_logs(s)
Anything else we need to know?:
I've experimented by adding s.sock.ping()
in each iteration of the tail_logs
function and it appears to improve the situation but some times it still get stuck inside readline_stdout, despite the timeout=1
parameter being set. :raised_eyebrow: It get stuck so long that ping never has chance to run within the next 300sec timeout.
Does one need to run ping
in a separate thread for this to work? That's how it is done inside the WebSocketApp
, as used by the long lived connection example from websocket library.
Can one use WebSocketapp together with stream or do i need to implement similar ping-feature myself when using stream?
I'm also curious where 300sec come from, is there a way to find this time through connection handshake or is it a setting in my cluster/azure?
Environment:
- Kubernetes version: Azure AKS v1.23.5
- Python version 3.9.7
- Python client version 23.6.0
You need to have TCP keep alives or some small amount of TCP traffic to keep the connection alive.
However, it's also worth noting that exec
was never really meant for long-running connections, there are much better ways to achieve the same thing via other types of APIs.
More discussion here: https://stackoverflow.com/questions/50706483/what-azure-kubernetes-aks-time-out-happens-to-disconnect-connections-in-out
Thank you.
Where does one enable TCP keep alive? This is normally a configuration on the socket from what i know, but such options are not exposed all the way up to the kubernetes_client apis. I can see that the WebSocket
constructor takes sockopt
argument as input, but the kubernetes.stream.ws_client.create_websocket
does not set this argument.
Also curios about which alternative APIs you suggest that are better than exec
? (I understand it's better that pods isolate their own workloads, but we have some scenarios where where we have to initiate longer operations via external triggers on already running containers, that take very long to load their dataset and re-initialize otherwise. Building our own application-level API in the container is something we want to avoid, and not really possible since the application is from third-party.)
If exec is not supported, it would be good if the limitations using it was explained in docs.
Can add that we managed to get a stable solution by adding pings. We also had to replace the readline_stdout(timeout=1)
with update()
+ peek_stdout()
+ read_stdout()
instead, so exactly as the example in https://github.com/kubernetes-client/python/blob/master/examples/pod_exec.py, plus pings. While this works, it's very intricate and easy to get wrong, an easier solution would be good.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle rotten
@hterik Are you able to share your solution? I am facing the same issue.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale