pipeline
pipeline copied to clipboard
TaskRun stays Running when pod goes to `FailedMount`
Expected Behavior
When a TaskRun can't correctly mount the volume, it should fail.
Actual Behavior
The TaskRun's remains "Running" with the message Pending until it eventually times out. The events with the error are at the pod level.
Steps to Reproduce the Problem
- Create a TaskRun with an inexistent config map or secret mount
Additional Info
Very similar issues:
- https://github.com/tektoncd/pipeline/issues/4895
- https://github.com/tektoncd/pipeline/issues/4890
/assign
This comment summarizes the findings for reproducing the 'failedMount' for TaskRun and suggests closing this issue.
- Pod level status:
pod status "Initialized":"False"; message: "containers with incomplete status: [prepare place-scripts]"
-
Pod status output:
- FailedMount
state:
waiting:
reason: ContainerCreating
- ImagePullBackOff
state:
waiting:
message: Back-off pulling image "does-not-exist:latest"
reason: ImagePullBackOff
Compared with what's for #4895, where the ImagePullBackOff
error can be captured, there is no appropriate way for this issue to be resolved as we could not tell if the volume has been correctly mounting or it should fail. Thus, this shall be closed for now.
Thanks for the guidance from @dibyom!
Maybe we should consider having a timeout specific for the "ContainerCreating" status? Or maybe we can check if there's any ongoing discussions about this on Kubernetes itself. I just don't feel this issue should be closed just because we don't have this info on the pod's state. There are still alternatives to consider, I think
A timeout specifically for the containerCreating status sounds too low level for most CI/CD use cases.
I get that the current situation isn't ideal but I think the best course right now is to have this fixed in Kubernetes (see https://github.com/kubernetes/kubernetes/issues/88193) and have this information surfaced somewhere in the Pod's status.
If you have a specific alternative to consider, let us know!
A timeout specifically for the containerCreating status sounds too low level for most CI/CD use cases.
I'm not sure what you mean by "too low level". I understand that maybe it's not something you want to customize at the Task/TaskRun level, but maybe at controller level would be fine. Tekton is very "low-level" in that sense, I think.
I get that the current situation isn't ideal but I think the best course right now is to have this fixed in Kubernetes (see https://github.com/kubernetes/kubernetes/issues/88193) and have this information surfaced somewhere in the Pod's status.
It's good to know that Kubernetes has an issue similar to this, but it's not clear to me if they will add to the state.waiting
fields.
If you have a specific alternative to consider, let us know!
Can't we look at the events directly then? I think that if we had a watcher over events we could use it for more than one scenario. Alternatively, we could only call the Events API when the controller notices the task is stuck on ContainerCreating
for too long.
I'm not sure what you mean by "too low level". I understand that maybe it's not something you want to customize at the Task/TaskRun level, but maybe at controller level would be fine. Tekton is very "low-level" in that sense, I think.
A controller level timeout would make more sense but I think it would be hard to come up with a timeout value here that will work for all runs in a cluster
Can't we look at the events directly then? I think that if we had a watcher over events we could use it for more than one scenario. Alternatively, we could only call the Events API when the controller notices the task is stuck on ContainerCreating for too long.
This might work - but I'm not sure if we can rely on it all the time: From https://kubernetes.io/docs/reference/kubernetes-api/cluster-resources/event-v1/:
"Event consumers should not rely on the timing of an event with a given Reason reflecting a consistent underlying trigger, or the continued existence of events with that Reason. Events should be treated as informative, best-effort, supplemental data."
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle stale
Send feedback to tektoncd/plumbing.
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle rotten
Send feedback to tektoncd/plumbing.
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
with a justification.
Mark the issue as fresh with /remove-lifecycle rotten
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/close
Send feedback to tektoncd/plumbing.
@tekton-robot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity. Reopen the issue with
/reopen
with a justification. Mark the issue as fresh with/remove-lifecycle rotten
with a justification. If this issue should be exempted, mark the issue as frozen with/lifecycle frozen
with a justification./close
Send feedback to tektoncd/plumbing.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.