pipeline icon indicating copy to clipboard operation
pipeline copied to clipboard

TestPipelineRunTimeout is flaky

Open mattmoor opened this issue 5 years ago • 5 comments

Expected Behavior

TestPipelineRunTimeout consistently passes with the properly observed failure mode.

Actual Behavior

TestPipelineRunTimeout often observes a "Failed" status on the PipelineRun(!!!).

Steps to Reproduce the Problem

Run the test a lot.

I will post my analysis here shortly, as I believe I know WHY this is happening.

mattmoor avatar Oct 26 '20 18:10 mattmoor

Every so often this test will fail with:

timeout_test.go:102: Error waiting for PipelineRun pipeline-run-timeout-qaakhkfe to finish: "pipeline-run-timeout-qaakhkfe" completed with the wrong reason: Failed

The setup is a task that sleeps for 10s and a pipeline that has a 5s timeout: https://github.com/tektoncd/pipeline/blob/adc4cfa733af86874017d07459cc5ad62bf82bd3/test/v1alpha1/timeout_test.go#L49-L61

This check expects the "Reason" to be "PipelineRunTimeout": https://github.com/tektoncd/pipeline/blob/adc4cfa733af86874017d07459cc5ad62bf82bd3/test/v1alpha1/timeout_test.go#L100-L103

However, when the test intermittently fails the status shows the TaskRun timed out, but the PipelineRun Failed:

        spec:
          pipelineRef:
            name: pipeline-run-timeout-kjlunngp
          serviceAccountName: default
          timeout: 5s     <-- Expected to timeout (the task sleeps for 10s)
        status:
          completionTime: "2020-10-24T12:37:13Z"
          conditions:
          - lastTransitionTime: "2020-10-24T12:37:13Z"
            message: 'Tasks Completed: 1 (Failed: 1, Cancelled 0), Skipped: 0'
            reason: Failed    <-- We get Failed instead of PipelineRunTimeout
            status: "False"
            type: Succeeded
          pipelineSpec:
            tasks:
            - name: foo
              taskRef:
                kind: Task
                name: pipeline-run-timeout-lfsmutbs
          startTime: "2020-10-24T12:37:09Z"
          taskRuns:
            pipeline-run-timeout-mncrmugc-foo-74kf8:
              pipelineTaskName: foo
              status:
                completionTime: "2020-10-24T12:37:13Z"
                conditions:
                - lastTransitionTime: "2020-10-24T12:37:13Z"
                  message: TaskRun "pipeline-run-timeout-mncrmugc-foo-74kf8" failed to finish within "5s"
                  reason: TaskRunTimeout   <-- The task run timed out (as expected) though
                  status: "False"
                  type: Succeeded

The irony is that the TaskRun gets its timeout from the PipelineRun here: https://github.com/tektoncd/pipeline/blob/6d8f4513213a0a85f5164e0ed13b8a15c1839f89/pkg/reconciler/pipelinerun/pipelinerun.go#L807-L816

When we transition the PipelineRun to "Failed" we skip this check: https://github.com/tektoncd/pipeline/blob/3185d055fa0959830d3c986407a1e6e58f074003/pkg/reconciler/pipelinerun/resources/pipelinerunstate.go#L206-L218

... digging into why, it looks at the StartTime in status, and in the cases where we fail the StartTime is less than 5s from the time of completion(!):

          	PipelineRunStatusFields: v1beta1.PipelineRunStatusFields{
          		StartTime:      s"2020-10-25 20:35:07 +0000 UTC",
        - 		CompletionTime: nil,
        + 		CompletionTime: s"2020-10-25 20:35:11.874768406 +0000 UTC m=+601.330431507",
          		TaskRuns: map[string]*v1beta1.PipelineRunTaskRunStatus{

The same status update shows us propagating the following for the TaskRun:

          					TaskRunStatusFields: v1beta1.TaskRunStatusFields{
          						PodName:        "pipeline-run-timeout-qaakhkfe-foo-zzlbf-pod-577ph",
          						StartTime:      s"2020-10-25 20:35:06 +0000 UTC",
        - 						CompletionTime: nil,
        + 						CompletionTime: s"2020-10-25 20:35:11 +0000 UTC",

mattmoor avatar Oct 26 '20 19:10 mattmoor

Generally this means that any PipelineRun specifying a Timeout may simple show up as Failed due to this race.

I think that my $0.02 on the appropriate "fix" would be that the PipelineRun's "StartTime" should never be greater than the "StartTime" of its constituent TaskRuns. This should guarantee that any copied-down timeout never manifests this way.

cc @vdemeester @afrittoli @ImJasonH @bobcatfish for thoughts 🙏

mattmoor avatar Oct 26 '20 19:10 mattmoor

Sorry, I should mention that the first bit of yaml is actually a dump from a prior run that I lifted from my older bug that this replaces. So don't try to rationalize it with the later "diffs".

The complete run for the latest failure (that I debugged) was here: https://github.com/mattmoor/mink/runs/1305814401?check_suite_focus=true#step:17:58

mattmoor avatar Oct 26 '20 19:10 mattmoor

I think that this is derived from state informer caches and the retry logic for updating status that we have.

On a subsequent reconciliation of the PipelineRun (+1s) from the stale uninitialized version, we re-InitializeConditions, which resets StartTime, but a prior reconciliation had already created the TaskRuns (likely what triggered us to be reconciled).

The diff for this status update looks like:

    kubelogs.go:197: D 20:35:07.329 controlplane-2 [pipelinerun/reconciler.go:322] [arendelle-px7ff/pipeline-run-timeout-qaakhkfe] Updating status with:   v1beta1.PipelineRunStatus{
          	Status: v1beta1.Status{
          		ObservedGeneration: 0,
        - 		Conditions:         nil,
        + 		Conditions: v1beta1.Conditions{
        + 			{
        + 				Type:               "Succeeded",
        + 				Status:             "Unknown",
        + 				LastTransitionTime: apis.VolatileTime{Inner: s"2020-10-25 20:35:07.188165477 +0"...},
        + 				Reason:             "Running",
        + 				Message:            "Tasks Completed: 0 (Failed: 0, Cancelled 0), Incomplete: 1, Skip"...,
        + 			},
        + 		},
          		Annotations: nil,
          	},
          	PipelineRunStatusFields: v1beta1.PipelineRunStatusFields{
        - 		StartTime:      nil,
        + 		StartTime:      s"2020-10-25 20:35:07.055779791 +0000 UTC m=+596.511442892",
          		CompletionTime: nil,
        - 		TaskRuns:       nil,
        + 		TaskRuns: map[string]*v1beta1.PipelineRunTaskRunStatus{
        + 			"pipeline-run-timeout-qaakhkfe-foo-zzlbf": &{PipelineTaskName: "foo", Status: &v1beta1.TaskRunStatus{}},
        + 		},
          		PipelineResults: nil,
        - 		PipelineSpec:    nil,
        + 		PipelineSpec:    &v1beta1.PipelineSpec{Tasks: []v1beta1.PipelineTask{{Name: "foo", TaskRef: &v1beta1.TaskRef{...}}}},
          		SkippedTasks:    nil,
          	},
          }

Note the empty StartTime in the base, despite this in a prior update:

          	PipelineRunStatusFields: v1beta1.PipelineRunStatusFields{
        - 		StartTime:      nil,
        + 		StartTime:      s"2020-10-25 20:35:06.759884985 +0000 UTC m=+596.215548086",

mattmoor avatar Oct 26 '20 19:10 mattmoor

TestPipelineRunTimeout test is still flaky

jerop avatar Aug 08 '22 14:08 jerop

fixed in https://github.com/tektoncd/pipeline/pull/5134 🤞🏾

thanks @abayer 🎉

jerop avatar Aug 19 '22 20:08 jerop