pipeline
pipeline copied to clipboard
TaskRun fails with recoverable mount error
Expected Behavior
TaskRun's pods should be able to recover from transient mount errors
Actual Behavior
When such an error occurs, the pod gets the state CreateContainerConfigError then the TaskRun fails.
Often the pod recovers, but it's too late.
This behavior was introduced in https://github.com/tektoncd/pipeline/pull/1907
Steps to Reproduce the Problem
Not sure exactly how to reproduce this, but we have a fairly big Tekton cluster and it happens quite often with a volume that uses AWS EFS. What happens it's that we notice the pod status like this:
status:
conditions:
- ...
- type: "ContainersReady"
status: "False"
lastProbeTime: null
lastTransitionTime: "2023-05-12T14:00:14Z"
reason: "ContainersNotReady"
message: "containers with unready status: [step-checkout]"
containerStatuses:
- name: "step-checkout"
state:
waiting:
reason: "CreateContainerConfigError"
message: "failed to create subPath directory for volumeMount \"ws-dmnjx\" of container \"step-checkout\""
Additional Info
-
Kubernetes version:
Output of
kubectl version:
Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.3-eks-a5565ad", GitCommit:"78c8293d1c65e8a153bf3c03802ab9358c0e1a14", GitTreeState:"clean", BuildDate:"2023-06-16T17:32:40Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
-
Tekton Pipeline version:
Output of
tkn versionorkubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'
v0.48.0
I can help with the fix... I was considering having a grace period before setting the TaskRun status as error. I'm not sure if we should hard-code this grace period or make it configurable via tekton controller's config maps. WDYT? Do we need a WEP for this?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.
/lifecycle stale
Send feedback to tektoncd/plumbing.
/remove-lifecycle stale @RafaeLeal I think that can make sense (having a grace period for this). I feel we might not necessarily need a TEP for this. cc @afrittoli
My team has tried to recover from a CreateContainerConfigError because the TaskRun hasn't really failed. Notice no completionTimestamp, and the one status.steps item is waiting, not terminated.
TaskRun
status:
conditions:
- lastTransitionTime: "2024-03-22T18:09:56Z"
message: Failed to create pod due to config error
reason: CreateContainerConfigError
status: "False"
type: Succeeded
startTime: "2024-03-22T18:09:40Z"
steps:
- container: step-check-step
name: check-step
waiting:
message: secret "oci-store" not found
reason: CreateContainerConfigError
In that waiting (but failed status) state, we tried to provide the correct configuration to pull an image, but the task never recovered. We had a pipeline tied to the task (it spawned the task), and it was in a terminated/failed/non-waiting/non-recoverable state.
We also went the other way, and waited for the pod to timeout while waiting, but the Task doesn't switch to being timed out. And of course, PipelineRun is still failed. And the pod hangs, never deleting.
I wonder, @RafaeLeal, you mentioned that the TaskRun fails, and pod recovers, but too late. In that state, is the TaskRun terminated at that point, with a completionTime, or is it still in waiting? I wonder if your problem is the same as ours, or if we need to make a separate issue.
There are a few other similar issues, some closed due to inactivity, but this issue (#6960) seems closest to what my team is seeing.
- https://github.com/tektoncd/pipeline/issues/3897
- https://github.com/tektoncd/pipeline/issues/2268
- https://github.com/tektoncd/pipeline/issues/7573