pipeline
pipeline copied to clipboard
sidecar nop image replacement cannot seem to exit with code 0
Expected Behavior
We expect our sidecar to exit with 0 when Tekton terminates it by replacing its image to be the default nop image.
Our sidecar runs the following code.
if [ -e <path to our custom binary> ]
then
<run the custom binary>
sleep infinity
fi
The path to our custom binary only exists in the original image and it does not exist in the nop image. and when the nop image runs, we expect the if condition to be false and exit with 0.
Actual Behavior
It exits with 1 and prints the following error message.
standard_init_linux.go:228: exec user process caused: no such file or directory
Steps to Reproduce the Problem
N/A. Sorry, the original sidecar image and nop image are not public. However, our nop image should be the same as the default Tekton nop image.
Additional Info
-
Kubernetes version:
Output of
kubectl version:
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.1", GitCommit:"206bcadf021e76c27513500ca24182692aabd17e", GitTreeState:"clean", BuildDate:"2020-09-09T19:10:58Z", GoVersion:"go1.15.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.9-eks-0d102a7", GitCommit:"eb09fc479c1b2bfcc35c47416efb36f1b9052d58", GitTreeState:"clean", BuildDate:"2022-02-17T16:36:28Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
- Tekton Pipeline version:
v0.30.0
@riceluxs1t thanks for the issue. This is by design, the current way we "handle" stopping sidecars in a Task is to replace the initial sidecar image with a nop image that should have litteraly nothing in, so that it just "dies".
Is the "code running" in the sidecar specified as a script or is it a script in the image ?
If it is using script, it probably exits with 1 because /bin/sh is not available in the nop image. If you switch the nop image with one that has /bin/sh, it would probably work as it would still execute the content of script but would exit with 0 as per the script content (because the binary doesn't exist).
you could mutate https://storage.googleapis.com/tekton-releases/pipeline/previous/v0.35.1/release.yaml with you own nop image to try that out. Replacing "-nop-image", "gcr.io/tekton-releases/github.com/tektoncd/pipeline/cmd/nop:v0.35.1@sha256:d45b4397536ee390078fd4da1074dd4cae76b6e418786c73f9aed43b9bc66545" with "-nop-image", "docker.io/library/bash:latest"
@vdemeester Thanks for this explanation! Would it help if in this case the nop image had some basic command at /bin/sh that just exited regardless of args, the way the default entrypoint does? We could also base nop on an image that contains a real shell, but that seems like a lot of surface to open up just so it can cleanly exit when invoked using script-mode.
@vdemeester Thanks for this explanation! Would it help if in this case the nop image had some basic command at
/bin/shthat just exited regardless of args, the way the default entrypoint does? We could also basenopon an image that contains a real shell, but that seems like a lot of surface to open up just so it can cleanly exit when invoked using script-mode.
It would it the case of a script with the default shebang 🙃 but as the user can define any shebang in the scripts, we would have to cover a bunch of them at least to be sure /bin/sh, /usr/bin/bash, /bin/bash, /usr/bin/env, … or we could try to "dynamically" handle all with something similar to https://github.com/mic92/envfs 🙃 (meaning that it dynamically links anything to our own /bin/sh thingy that does exit 0 all the time).
Oh my. This got complicated quickly! 😅
I'd be fine with providing common shebangs in the image. Sidecars that exit 1 upon teardown shouldn't fail the TaskRun anyway right? This is just so the underlying Pods don't look like they've failed, as I understand it.
I'd be fine with providing common shebangs in the image. Sidecars that exit 1 upon teardown shouldn't fail the TaskRun anyway right? This is just so the underlying Pods don't look like they've failed, as I understand it.
Yeah that would work 👍🏼 we just need to list the X most used ones and iterate over it as time goes I think. We may also want to dig into using the entrypoint and a file (using downward api ?) to signal that the taskrun is done to stop the process more cleanly.. (but I think we already thought of that.. and there might have been good reason not to do it ?)
This image OP is using above does have a /bin/sh so I don't believe that is the issue they we seeing. I think there is still merit to the idea we explored a while ago to attempting to gracefully shutdown via a kill signal and after some timeout (either if the container doesn't have a kill or what) we do the nop replace and that should generally solve this case.
As it stands, it can lead to confusing pod statuses where the sidecar fails even if the Tekton status is correct.
I didn't think we were wrapping sidecar invocations in the entrypointer at all, but if we are (or if we decide to) then signaling via the downward API sounds simplest to me. I don't think we'd even need to send signals to the sidecar process, we can just have the entrypoint process exit.
This is true, we are not wrapping it, just relying on the image swap. Using the entrypointer is actually a keen idea and I think it means that we can avoid the nop image swap altogether (and indeed wouldn't work since the nop wouldn't have a shell for the entrypointer to execute in anyways).
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.
/lifecycle stale
Send feedback to tektoncd/plumbing.
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.
/lifecycle rotten
Send feedback to tektoncd/plumbing.
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.
/close
Send feedback to tektoncd/plumbing.
@tekton-robot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity. Reopen the issue with
/reopenwith a justification. Mark the issue as fresh with/remove-lifecycle rottenwith a justification. If this issue should be exempted, mark the issue as frozen with/lifecycle frozenwith a justification./close
Send feedback to tektoncd/plumbing.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.