pipeline icon indicating copy to clipboard operation
pipeline copied to clipboard

sidecar nop image replacement cannot seem to exit with code 0

Open riceluxs1t opened this issue 3 years ago • 10 comments

Expected Behavior

We expect our sidecar to exit with 0 when Tekton terminates it by replacing its image to be the default nop image.

Our sidecar runs the following code.

if [ -e <path to our custom binary> ]
then
  <run the custom binary>
  sleep infinity
fi

The path to our custom binary only exists in the original image and it does not exist in the nop image. and when the nop image runs, we expect the if condition to be false and exit with 0.

Actual Behavior

It exits with 1 and prints the following error message.

standard_init_linux.go:228: exec user process caused: no such file or directory

Steps to Reproduce the Problem

N/A. Sorry, the original sidecar image and nop image are not public. However, our nop image should be the same as the default Tekton nop image.

Additional Info

  • Kubernetes version:

    Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.1", GitCommit:"206bcadf021e76c27513500ca24182692aabd17e", GitTreeState:"clean", BuildDate:"2020-09-09T19:10:58Z", GoVersion:"go1.15.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.9-eks-0d102a7", GitCommit:"eb09fc479c1b2bfcc35c47416efb36f1b9052d58", GitTreeState:"clean", BuildDate:"2022-02-17T16:36:28Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"} 
  • Tekton Pipeline version: v0.30.0

riceluxs1t avatar May 19 '22 00:05 riceluxs1t

@riceluxs1t thanks for the issue. This is by design, the current way we "handle" stopping sidecars in a Task is to replace the initial sidecar image with a nop image that should have litteraly nothing in, so that it just "dies".

Is the "code running" in the sidecar specified as a script or is it a script in the image ? If it is using script, it probably exits with 1 because /bin/sh is not available in the nop image. If you switch the nop image with one that has /bin/sh, it would probably work as it would still execute the content of script but would exit with 0 as per the script content (because the binary doesn't exist).

you could mutate https://storage.googleapis.com/tekton-releases/pipeline/previous/v0.35.1/release.yaml with you own nop image to try that out. Replacing "-nop-image", "gcr.io/tekton-releases/github.com/tektoncd/pipeline/cmd/nop:v0.35.1@sha256:d45b4397536ee390078fd4da1074dd4cae76b6e418786c73f9aed43b9bc66545" with "-nop-image", "docker.io/library/bash:latest"

vdemeester avatar May 19 '22 11:05 vdemeester

@vdemeester Thanks for this explanation! Would it help if in this case the nop image had some basic command at /bin/sh that just exited regardless of args, the way the default entrypoint does? We could also base nop on an image that contains a real shell, but that seems like a lot of surface to open up just so it can cleanly exit when invoked using script-mode.

imjasonh avatar May 19 '22 13:05 imjasonh

@vdemeester Thanks for this explanation! Would it help if in this case the nop image had some basic command at /bin/sh that just exited regardless of args, the way the default entrypoint does? We could also base nop on an image that contains a real shell, but that seems like a lot of surface to open up just so it can cleanly exit when invoked using script-mode.

It would it the case of a script with the default shebang 🙃 but as the user can define any shebang in the scripts, we would have to cover a bunch of them at least to be sure /bin/sh, /usr/bin/bash, /bin/bash, /usr/bin/env, … or we could try to "dynamically" handle all with something similar to https://github.com/mic92/envfs 🙃 (meaning that it dynamically links anything to our own /bin/sh thingy that does exit 0 all the time).

vdemeester avatar May 19 '22 16:05 vdemeester

Oh my. This got complicated quickly! 😅

I'd be fine with providing common shebangs in the image. Sidecars that exit 1 upon teardown shouldn't fail the TaskRun anyway right? This is just so the underlying Pods don't look like they've failed, as I understand it.

imjasonh avatar May 19 '22 16:05 imjasonh

I'd be fine with providing common shebangs in the image. Sidecars that exit 1 upon teardown shouldn't fail the TaskRun anyway right? This is just so the underlying Pods don't look like they've failed, as I understand it.

Yeah that would work 👍🏼 we just need to list the X most used ones and iterate over it as time goes I think. We may also want to dig into using the entrypoint and a file (using downward api ?) to signal that the taskrun is done to stop the process more cleanly.. (but I think we already thought of that.. and there might have been good reason not to do it ?)

vdemeester avatar May 19 '22 16:05 vdemeester

This image OP is using above does have a /bin/sh so I don't believe that is the issue they we seeing. I think there is still merit to the idea we explored a while ago to attempting to gracefully shutdown via a kill signal and after some timeout (either if the container doesn't have a kill or what) we do the nop replace and that should generally solve this case.

As it stands, it can lead to confusing pod statuses where the sidecar fails even if the Tekton status is correct.

pierretasci avatar May 19 '22 16:05 pierretasci

I didn't think we were wrapping sidecar invocations in the entrypointer at all, but if we are (or if we decide to) then signaling via the downward API sounds simplest to me. I don't think we'd even need to send signals to the sidecar process, we can just have the entrypoint process exit.

imjasonh avatar May 19 '22 17:05 imjasonh

This is true, we are not wrapping it, just relying on the image swap. Using the entrypointer is actually a keen idea and I think it means that we can avoid the nop image swap altogether (and indeed wouldn't work since the nop wouldn't have a shell for the entrypointer to execute in anyways).

pierretasci avatar May 19 '22 18:05 pierretasci

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot avatar Aug 17 '22 19:08 tekton-robot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten with a justification. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot avatar Sep 16 '22 19:09 tekton-robot

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen with a justification. Mark the issue as fresh with /remove-lifecycle rotten with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot avatar Oct 16 '22 19:10 tekton-robot

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen with a justification. Mark the issue as fresh with /remove-lifecycle rotten with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot avatar Oct 16 '22 19:10 tekton-robot