pipeline icon indicating copy to clipboard operation
pipeline copied to clipboard

TaskRun stays Running when pod goes to `FailedMount`

Open RafaeLeal opened this issue 2 years ago • 5 comments

Expected Behavior

When a TaskRun can't correctly mount the volume, it should fail.

Actual Behavior

The TaskRun's remains "Running" with the message Pending until it eventually times out. The events with the error are at the pod level.

Steps to Reproduce the Problem

  1. Create a TaskRun with an inexistent config map or secret mount

Additional Info

Very similar issues:

  • https://github.com/tektoncd/pipeline/issues/4895
  • https://github.com/tektoncd/pipeline/issues/4890

RafaeLeal avatar Jul 11 '22 14:07 RafaeLeal

/assign

JeromeJu avatar Jul 19 '22 19:07 JeromeJu

This comment summarizes the findings for reproducing the 'failedMount' for TaskRun and suggests closing this issue.

  • Pod level status:
pod status "Initialized":"False"; message: "containers with incomplete status: [prepare place-scripts]"
  • Pod status output:
    • FailedMount
state:
      waiting:
        reason: ContainerCreating
  • ImagePullBackOff
 state:
      waiting:
        message: Back-off pulling image "does-not-exist:latest"
        reason: ImagePullBackOff

Compared with what's for #4895, where the ImagePullBackOff error can be captured, there is no appropriate way for this issue to be resolved as we could not tell if the volume has been correctly mounting or it should fail. Thus, this shall be closed for now.

Thanks for the guidance from @dibyom!

JeromeJu avatar Jul 22 '22 18:07 JeromeJu

Maybe we should consider having a timeout specific for the "ContainerCreating" status? Or maybe we can check if there's any ongoing discussions about this on Kubernetes itself. I just don't feel this issue should be closed just because we don't have this info on the pod's state. There are still alternatives to consider, I think

RafaeLeal avatar Jul 30 '22 23:07 RafaeLeal

A timeout specifically for the containerCreating status sounds too low level for most CI/CD use cases.

I get that the current situation isn't ideal but I think the best course right now is to have this fixed in Kubernetes (see https://github.com/kubernetes/kubernetes/issues/88193) and have this information surfaced somewhere in the Pod's status.

If you have a specific alternative to consider, let us know!

dibyom avatar Aug 03 '22 21:08 dibyom

A timeout specifically for the containerCreating status sounds too low level for most CI/CD use cases.

I'm not sure what you mean by "too low level". I understand that maybe it's not something you want to customize at the Task/TaskRun level, but maybe at controller level would be fine. Tekton is very "low-level" in that sense, I think.

I get that the current situation isn't ideal but I think the best course right now is to have this fixed in Kubernetes (see https://github.com/kubernetes/kubernetes/issues/88193) and have this information surfaced somewhere in the Pod's status.

It's good to know that Kubernetes has an issue similar to this, but it's not clear to me if they will add to the state.waiting fields.

If you have a specific alternative to consider, let us know!

Can't we look at the events directly then? I think that if we had a watcher over events we could use it for more than one scenario. Alternatively, we could only call the Events API when the controller notices the task is stuck on ContainerCreating for too long.

RafaeLeal avatar Aug 11 '22 02:08 RafaeLeal

I'm not sure what you mean by "too low level". I understand that maybe it's not something you want to customize at the Task/TaskRun level, but maybe at controller level would be fine. Tekton is very "low-level" in that sense, I think.

A controller level timeout would make more sense but I think it would be hard to come up with a timeout value here that will work for all runs in a cluster

Can't we look at the events directly then? I think that if we had a watcher over events we could use it for more than one scenario. Alternatively, we could only call the Events API when the controller notices the task is stuck on ContainerCreating for too long.

This might work - but I'm not sure if we can rely on it all the time: From https://kubernetes.io/docs/reference/kubernetes-api/cluster-resources/event-v1/:

"Event consumers should not rely on the timing of an event with a given Reason reflecting a consistent underlying trigger, or the continued existence of events with that Reason. Events should be treated as informative, best-effort, supplemental data."

dibyom avatar Aug 30 '22 20:08 dibyom

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot avatar Nov 28 '22 21:11 tekton-robot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten with a justification. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot avatar Dec 28 '22 21:12 tekton-robot

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen with a justification. Mark the issue as fresh with /remove-lifecycle rotten with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot avatar Jan 27 '23 21:01 tekton-robot

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen with a justification. Mark the issue as fresh with /remove-lifecycle rotten with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot avatar Jan 27 '23 21:01 tekton-robot