test-infra icon indicating copy to clipboard operation
test-infra copied to clipboard

Hundreds of finished test pods in Terminating state

Open sdif opened this issue 2 years ago • 6 comments

What happened:

We noticed hundreds of pods created by prow jobs that are completed (either successfully or not) are stuck in Terminating state until Sinker kicks in to remove them:

kgp | grep Terminating | wc -l
     826

Oldest one is 47h old, here is sinker configuration:

sinker:
  resync_period: 1m
  max_prowjob_age: 48h
  max_pod_age: 48h
  terminated_pod_ttl: 30m

We did not notice any service degradation or issue related to the number of pods, but we found this behavior really strange and it is not written anywhere in the documentation nor where we able to find an issue similar to this one

Also, jobs and images are different on each pod so it doesnt seem to be related to this.

We use prow with version v20220206-e1cc2403ac Kubernetes version we use: v1.21.6-gke.1503

What you expected to happen: We were expecting completed pods to be in Terminated state as the note from Kubernetes documentation says:

Note: When a Pod is being deleted, it is shown as Terminating by some kubectl commands. This Terminating status is not one of the Pod phases. A Pod is granted a term to terminate gracefully, which defaults to 30 seconds. You can use the flag --force to terminate a Pod by force.

Is it normal or is it a bug / misconfiguration ?

How to reproduce it (as minimally and precisely as possible): Run any job and see what is the pod state

Please provide links to example occurrences, if any: n/a Anything else we need to know?: n/a

Thanks in advance

sdif avatar Apr 12 '22 10:04 sdif

Can you post the yaml of such a pod, please?

alvaroaleman avatar Apr 13 '22 13:04 alvaroaleman

Hi @alvaroaleman , Here is the link yo the pod yaml (pw is: HS2022) : https://privatebin.net/?a68885cceb4e8009#3cAA25Bp7NaTi5mDQFuwFqXxkgh2NZuQE17iQJQ9NZ21

FYI we tried to:

  • Force delete the pod without any success, command hangs and pod is still terminating:
k delete pod cb76e4b4-bad6-11ec-a69a-f20a8b4f4d2e --force --grace-period=0
  • Delete LimitRanger annotation (suggested by GCP Support with this source) and force delete the pod without success

sdif avatar Apr 14 '22 10:04 sdif

@sdif that is very odd, and with sinker it works? It also just does a delete: https://github.com/kubernetes/test-infra/blob/5ed7f971e5e68db90d749f9acd2f40e012792932/prow/cmd/sinker/main.go#L484

alvaroaleman avatar Apr 14 '22 12:04 alvaroaleman

It seems so as there is no pod older than 48h which is the configuration we have for sinker, and it seems it is working fine:

{"cluster":"default","component":"sinker","file":"prow/cmd/sinker/main.go:480","func":"main.(*controller).deletePod","level":"info","msg":"Deleted old completed pod.","pj":"5ca61bed-bbc5-11ec-a0b7-5a1fbb201c8c","pod":"5ca61bed-bbc5-11ec-a0b7-5a1fbb201c8c","reason":"ttled","severity":"info","time":"2022-04-15T12:06:27Z"}
{"cluster":"default","component":"sinker","file":"prow/cmd/sinker/main.go:480","func":"main.(*controller).deletePod","level":"info","msg":"Deleted old completed pod.","pj":"dffe85db-bb6b-11ec-ade7-76b0cdd0d8ee","pod":"dffe85db-bb6b-11ec-ade7-76b0cdd0d8ee","reason":"ttled","severity":"info","time":"2022-04-15T12:06:28Z"}

sdif avatar Apr 15 '22 12:04 sdif

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 18 '22 03:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Aug 17 '22 04:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar Sep 16 '22 04:09 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 16 '22 04:09 k8s-ci-robot