pipeline icon indicating copy to clipboard operation
pipeline copied to clipboard

TaskRun fails with recoverable mount error

Open RafaeLeal opened this issue 2 years ago • 5 comments
trafficstars

Expected Behavior

TaskRun's pods should be able to recover from transient mount errors

Actual Behavior

When such an error occurs, the pod gets the state CreateContainerConfigError then the TaskRun fails. Often the pod recovers, but it's too late. This behavior was introduced in https://github.com/tektoncd/pipeline/pull/1907

Steps to Reproduce the Problem

Not sure exactly how to reproduce this, but we have a fairly big Tekton cluster and it happens quite often with a volume that uses AWS EFS. What happens it's that we notice the pod status like this:

status:
  conditions:
    - ...
    - type: "ContainersReady"
      status: "False"
      lastProbeTime: null
      lastTransitionTime: "2023-05-12T14:00:14Z"
      reason: "ContainersNotReady"
      message: "containers with unready status: [step-checkout]"
  containerStatuses:
    - name: "step-checkout"
      state:
        waiting:
          reason: "CreateContainerConfigError"
          message: "failed to create subPath directory for volumeMount \"ws-dmnjx\" of container \"step-checkout\""

Additional Info

  • Kubernetes version:

    Output of kubectl version:

Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.3-eks-a5565ad", GitCommit:"78c8293d1c65e8a153bf3c03802ab9358c0e1a14", GitTreeState:"clean", BuildDate:"2023-06-16T17:32:40Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
  • Tekton Pipeline version:

    Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

v0.48.0

RafaeLeal avatar Jul 21 '23 19:07 RafaeLeal

I can help with the fix... I was considering having a grace period before setting the TaskRun status as error. I'm not sure if we should hard-code this grace period or make it configurable via tekton controller's config maps. WDYT? Do we need a WEP for this?

RafaeLeal avatar Jul 21 '23 19:07 RafaeLeal

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot avatar Oct 19 '23 20:10 tekton-robot

/remove-lifecycle stale @RafaeLeal I think that can make sense (having a grace period for this). I feel we might not necessarily need a TEP for this. cc @afrittoli

vdemeester avatar Oct 23 '23 10:10 vdemeester

My team has tried to recover from a CreateContainerConfigError because the TaskRun hasn't really failed. Notice no completionTimestamp, and the one status.steps item is waiting, not terminated.

TaskRun

status:
  conditions:
  - lastTransitionTime: "2024-03-22T18:09:56Z"
    message: Failed to create pod due to config error
    reason: CreateContainerConfigError
    status: "False"
    type: Succeeded
  startTime: "2024-03-22T18:09:40Z"
  steps:
  - container: step-check-step
    name: check-step
    waiting:
      message: secret "oci-store" not found
      reason: CreateContainerConfigError

In that waiting (but failed status) state, we tried to provide the correct configuration to pull an image, but the task never recovered. We had a pipeline tied to the task (it spawned the task), and it was in a terminated/failed/non-waiting/non-recoverable state.

We also went the other way, and waited for the pod to timeout while waiting, but the Task doesn't switch to being timed out. And of course, PipelineRun is still failed. And the pod hangs, never deleting.

I wonder, @RafaeLeal, you mentioned that the TaskRun fails, and pod recovers, but too late. In that state, is the TaskRun terminated at that point, with a completionTime, or is it still in waiting? I wonder if your problem is the same as ours, or if we need to make a separate issue.

codegold79 avatar Mar 25 '24 20:03 codegold79

There are a few other similar issues, some closed due to inactivity, but this issue (#6960) seems closest to what my team is seeing.

  • https://github.com/tektoncd/pipeline/issues/3897
  • https://github.com/tektoncd/pipeline/issues/2268
  • https://github.com/tektoncd/pipeline/issues/7573

codegold79 avatar Mar 25 '24 20:03 codegold79