Argoworklow failed to kill linkerd proxy container
What problem are you trying to solve?
- We are using argoworklow with linkerd-proxy.
- Information about how argoworkflow kills sidecar: https://argoproj.github.io/argo-workflows/sidecar-injection/#how-we-kill-sidecars-using-kubectl-exec
- We tried updating linkerd to the latest version from version 2.10.x to 2.11.x
- Since version 2.11.x now uses the distroless base image that does not have the /bin/sh and kill command, argoworklow's wait container fails to kill linkerd-proxy and the workflow step was marked as failed because linkerd-proxy is still running.
How should the problem be solved?
- Would be nice if there is a command available within the linkerd-proxy container to kill itself or shutdown itself like "linkerd-shutdown --graceful"
- Similar to how istio provides the pilot agent request command: https://istio.io/latest/docs/reference/commands/pilot-agent/#pilot-agent-request
- This annotation can be set as a default workflow value so this is nice to have.
Any alternatives you've considered?
- Using linkerd-await in application container, but the application docker image should not care wether if it is using linkerd or not, should be agnostic.
- Using a argoworkflow container set, with a curl container at the end to point to linkerd-proxy shutdown path. Don't like this because this forces all workflow to use container sets and some developer may forget to add this.
How would users interact with this feature?
As simple as running linkerd-shutdown --args <>
Would you like to work on this feature?
No response
Sending a SIGTERM to a linkerd-proxy sidecar should trigger a graceful shutdown process where the proxy begins refusing all new connections, and waits for any currently-open connections to close prior to terminating.
However, in the current 2.11.4 release, the proxy will wait indefinitely for all open connections to close. In some cases, it seems that conntrack can occasionally lose track of half-closed TCP connections, and the proxy may not notice when they close (see https://github.com/linkerd/linkerd2/issues/8033#issuecomment-1122977190). This results in issues where the proxy fails to terminate.
To resolve this, we've added a timeout for the maximum grace period the proxy itself will wait for connections to close gracefully (see #8923). By default, the proxy will now wait for connections to close gracefully for a maximum of two minutes, and will then shut down regardless of any remaining "open" TCP connections. The graceful shutdown timeout will be available in the next edge release, and you'll be able to override the default value of two minutes with an annotation like this:
annotations:
config.linkerd.io/shutdown-grace-period: "30s"
The shutdown grace period timeout should ensure that your linkerd-proxy containers always terminate in a timely manner after argoworkflow sends them SIGTERMs.
Hope that's helpful!
Okay, I will try this once the version is available. We can keep this issue open in the mean time?
@kaiyuanlim the shutdown-grace-period annotation is available in edge-22.7.2 and later. It will be available on stable in the next stable release.
Thank you