cilium-cli icon indicating copy to clipboard operation
cilium-cli copied to clipboard

Improve Reliability and Error Handling of remotecommand

Open learnitall opened this issue 10 months ago • 0 comments

The cilium-cli relies on the remotecommand library to interact with Cilium Agents running inside a cluster. As a result, the cilium-cli connectivity tests essentially perform a stress test on the remotecommand library and streaming functionality of the Kubernetes API Server, which has helped identify race conditions upstream. @bimmlerd did a fantastic analysis and opened the following PRs:

  • https://github.com/kubernetes/kubernetes/pull/124335
  • https://github.com/kubernetes/kubernetes/pull/123705

In the meantime, while these are being worked on, it makes sense for changes to be made to the cilium-cli to boost the robustness of its use of remotecommand.

This issue tracks defining and implementing a protocol (ie a set of rules) that the cilium-cli will use to detect unexpected errors or race conditions when using the remote executor. To start, https://github.com/kubernetes/kubernetes/pull/124335 can be addressed by something like the following:

  1. Before executing the remote command, we generate a random integer.
  2. Each command that is executed is wrapped in a tiny script which executes the given command and then echos the aforementioned random integer on a newline in both stderr and stdout.
  3. Assert that the full integer is received in both stdout and stderr. If the integer was not received in full, some kind of race condition or flake occurred. From here, we can determine if we need to retry.
  4. Strip the integer out of stdout and stderr and then return back to the user.

Additionally, we want to add onto our radar transitioning to the WebSocketExecutor from the SPDYExecutor. The WebSocketExecutor is described in KEP-4006. Some work has landed in v0.30.0 and v0.29.4 of k8s.io/client-go.

learnitall avatar Apr 22 '24 21:04 learnitall