python icon indicating copy to clipboard operation
python copied to clipboard

Unexpected WebSocketConnectionClosedException when streaming long running commands, missing pings?

Open hterik opened this issue 2 years ago • 25 comments

What happened (please include outputs or screenshots): After exactly 300sec of idle, exec commands streamed with stream loose their connection, even though

  • pod is still running,
  • The process that was execd in container is still running (checked with kubectl exec POD_NAME ps aux)
  • Network otherwise appears to be healthy.
  • No warnings logged by kube-apiserver

Following stack trace is given

  File "kubernetes/stream/ws_client.py", line 141, in readline_stderr
    return self.readline_channel(STDERR_CHANNEL, timeout=timeout)
  File "kubernetes/stream/ws_client.py", line 104, in readline_channel
    self.update(timeout=(timeout - time.time() + start))
  File "kubernetes/stream/ws_client.py", line 192, in update
    op_code, frame = self.sock.recv_data_frame(True)
  File "websocket/_core.py", line 401, in recv_data_frame
    frame = self.recv_frame()
  File "websocket/_core.py", line 440, in recv_frame
    return self.frame_buffer.recv_frame()
  File "websocket/_abnf.py", line 338, in recv_frame
    self.recv_header()
  File "websocket/_abnf.py", line 294, in recv_header
    header = self.recv_strict(2)
  File "websocket/_abnf.py", line 373, in recv_strict
    bytes_ = self.recv(min(16384, shortage))
  File "websocket/_core.py", line 524, in _recv
    return recv(self.sock, bufsize)
  File "websocket/_socket.py", line 122, in recv
    raise WebSocketConnectionClosedException(
websocket._exceptions.WebSocketConnectionClosedException: Connection to remote host was lost. 

What you expected to happen:

  • Connection is not lost
  • If it is lost, connection is automatically reconnected (up to some configured timeout limit) Reading https://github.com/websocket-client/websocket-client#long-lived-connection, there is a description on how to use a built in way from the websocket library to do this. It appears that the stream command is not using this method.

How to reproduce it (as minimally and precisely as possible): (Mostly inspired by https://github.com/kubernetes-client/python/blob/master/examples/pod_exec.py)

from kubernetes.stream import stream
...
...
s = stream(
    _client.connect_get_namespaced_pod_exec,
    _pod.metadata.name,
    _pod.metadata.namespace,
    command=["sleep", "100000"],           
    async_req=False,
    stderr=True,
    stdin=False,
    stdout=True,
    tty=False,
    _preload_content=False,
)

def tail_logs():
    if (ret_stdout := s.readline_stdout(timeout=1)) is not None:
        print(ret_stdout)

while (retcode := s.returncode) is None:
    tail_logs(s)
tail_logs(s)

Anything else we need to know?: I've experimented by adding s.sock.ping() in each iteration of the tail_logs function and it appears to improve the situation but some times it still get stuck inside readline_stdout, despite the timeout=1 parameter being set. :raised_eyebrow: It get stuck so long that ping never has chance to run within the next 300sec timeout. Does one need to run ping in a separate thread for this to work? That's how it is done inside the WebSocketApp, as used by the long lived connection example from websocket library. Can one use WebSocketapp together with stream or do i need to implement similar ping-feature myself when using stream?

I'm also curious where 300sec come from, is there a way to find this time through connection handshake or is it a setting in my cluster/azure?

Environment:

  • Kubernetes version: Azure AKS v1.23.5
  • Python version 3.9.7
  • Python client version 23.6.0

hterik avatar Jun 22 '22 09:06 hterik

You need to have TCP keep alives or some small amount of TCP traffic to keep the connection alive.

However, it's also worth noting that exec was never really meant for long-running connections, there are much better ways to achieve the same thing via other types of APIs.

brendandburns avatar Jul 11 '22 16:07 brendandburns

More discussion here: https://stackoverflow.com/questions/50706483/what-azure-kubernetes-aks-time-out-happens-to-disconnect-connections-in-out

brendandburns avatar Jul 11 '22 16:07 brendandburns

Thank you. Where does one enable TCP keep alive? This is normally a configuration on the socket from what i know, but such options are not exposed all the way up to the kubernetes_client apis. I can see that the WebSocket constructor takes sockopt argument as input, but the kubernetes.stream.ws_client.create_websocket does not set this argument.

Also curios about which alternative APIs you suggest that are better than exec? (I understand it's better that pods isolate their own workloads, but we have some scenarios where where we have to initiate longer operations via external triggers on already running containers, that take very long to load their dataset and re-initialize otherwise. Building our own application-level API in the container is something we want to avoid, and not really possible since the application is from third-party.) If exec is not supported, it would be good if the limitations using it was explained in docs.


Can add that we managed to get a stable solution by adding pings. We also had to replace the readline_stdout(timeout=1) with update() + peek_stdout() + read_stdout() instead, so exactly as the example in https://github.com/kubernetes-client/python/blob/master/examples/pod_exec.py, plus pings. While this works, it's very intricate and easy to get wrong, an easier solution would be good.

hterik avatar Jul 13 '22 06:07 hterik

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 11 '22 07:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Nov 10 '22 07:11 k8s-triage-robot

/remove-lifecycle rotten

hterik avatar Nov 10 '22 12:11 hterik

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 08 '23 13:02 k8s-triage-robot

/remove-lifecycle rotten

hterik avatar Feb 24 '23 15:02 hterik

@hterik Are you able to share your solution? I am facing the same issue.

jaredbischof avatar Mar 09 '23 16:03 jaredbischof

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Apr 08 '23 17:04 k8s-triage-robot

/remove-lifecycle rotten

hterik avatar Apr 11 '23 05:04 hterik

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 10 '23 06:07 k8s-triage-robot

/remove-lifecycle stale

hterik avatar Dec 01 '23 14:12 hterik

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 29 '24 15:02 k8s-triage-robot

/remove-lifecycle stale

hterik avatar Mar 04 '24 14:03 hterik