pipeline icon indicating copy to clipboard operation
pipeline copied to clipboard

in tekton situation. we may have many dead pod, this can cause ContainerGCFailed error in kubernetes

Open oldthreefeng opened this issue 3 years ago • 2 comments

Expected Behavior

run tekton normal

Actual Behavior

we run tekton sometimes, the node will be notready cause grpc: trying to send message larger than max

Events:
  Type     Reason             Age                      From     Message
  ----     ------             ----                     ----     -------
  Warning  ContainerGCFailed  3m32s (x8567 over 6d3h)  kubelet  rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (16798538 vs. 16777216)

we check the pod in node, we find there are many dead container in the node ,so cause the problem

Steps to Reproduce the Problem

  1. deploy tekton
  2. use ci/cd many days
  3. the problem is come

Additional Info

  • Kubernetes version:
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:23:52Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.8-aliyun.1", GitCommit:"27f24d2", GitTreeState:"", BuildDate:"2021-08-19T10:00:16Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
  • Tekton Pipeline version:
kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}' 
v0.25.0

tmp way to solve the problem.

https://github.com/kubernetes/kubernetes/issues/63858#issuecomment-1234153901

i think this should be added in your docs~ thanks tekton team.

oldthreefeng avatar Sep 01 '22 11:09 oldthreefeng

@oldthreefeng Is this happening for pod created by taskrun/piplinerun? Or is it happening something at operator level.

piyush-garg avatar Sep 08 '22 12:09 piyush-garg

@piyush-garg taskrun/piplinerun . i think this is not operator issue. :)

oldthreefeng avatar Sep 20 '22 02:09 oldthreefeng

It's not clear why the dead containers aren't garbage collected, but it does seem that Tekton containers can build up fast and cause the ContainerGCFailed problem.

For OpenShift we run this in a CronJob to clean up the Tekton helper containers...

oc get nodes -o=name --no-headers -l node-role.kubernetes.io/build | while read -r host _; do
  oc debug $host -- chroot /host sh -c \
    "crictl ps -a -s Exited --name '^(place-tools|place-scripts)$' -q | xargs -r crictl rm"
done

ryanotella avatar Dec 19 '22 05:12 ryanotella

Ideally Pods will be removed when the object that owns them (TaskRun, …) is removed as well.

vdemeester avatar Dec 19 '22 09:12 vdemeester

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot avatar Mar 19 '23 09:03 tekton-robot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten with a justification. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot avatar Apr 18 '23 09:04 tekton-robot

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen with a justification. Mark the issue as fresh with /remove-lifecycle rotten with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot avatar May 18 '23 10:05 tekton-robot

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen with a justification. Mark the issue as fresh with /remove-lifecycle rotten with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot avatar May 18 '23 10:05 tekton-robot