pipeline icon indicating copy to clipboard operation
pipeline copied to clipboard

Tekton Controller panics while using `enable-api-fields: alpha`

Open RafaeLeal opened this issue 2 years ago • 12 comments

Expected Behavior

The alpha features become available

Actual Behavior

The tekton-pipelines-controller crashed with this panic message: fatal error: stack overflow.

I created a gist with the full stack trace https://gist.github.com/RafaeLeal/00c944a670bdba9a178b23da9eeb0e7c

It seems to have entered a 3-function infinite recursion loop:

But I don't know how we got to this loop, or if this logic is indeed meant to be recursive.

Steps to Reproduce the Problem

I'm not sure exactly how to reproduce this problem.. but I had a cluster with around ~400 pipelineruns running while I updated the config map feature-flags to set enable-api-fields: alpha. Even after rolling back this change (setting enable-api-fields back to stable), the problem remained. Only after we deleted all pipelineruns we were able to restore the tekton-pipelines-controller health.

Additional Info

  • Kubernetes version:

    Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.9-eks-a64ea69", GitCommit:"540410f9a2e24b7a2a870ebfacb3212744b5f878", GitTreeState:"clean", BuildDate:"2022-05-12T19:15:31Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
    
  • Tekton Pipeline version: v0.35.1

RafaeLeal avatar Aug 04 '22 13:08 RafaeLeal

This code path still exists in HEAD - makes me think this is likely not something fixed since 0.35.1.

abayer avatar Aug 04 '22 13:08 abayer

/assign

My first thought is that rpt.Skip(facts).SkippingReason == v1beta1.WhenExpressionsSkip should be replaced with rpt.skipBecauseWhenExpressionsEvaluatedToFalse(facts), since the former is basically a roundabout way of determining if the latter is true.

I'm going to work on writing a test to reproduce this now.

abayer avatar Aug 04 '22 13:08 abayer

@RafaeLeal Any chance you could provide a PipelineRun that hit this problem? Or the Pipeline for one of those PipelineRuns? I think this is somehow related to parameter behavior behind the alpha flag, either array indexing or parameter propagation, so it'd be great to see what the actual Pipeline looks like in terms of parameters and results...

abayer avatar Aug 04 '22 14:08 abayer

@RafaeLeal Any chance you could provide a PipelineRun that hit this problem? Or the Pipeline for one of those PipelineRuns? I think this is somehow related to parameter behavior behind the alpha flag, either array indexing or parameter propagation, so it'd be great to see what the actual Pipeline looks like in terms of parameters and results...

So I did a backup from what was running on the cluster at that time.. but we didn't manage to understand which pipeline run was causing this issue... the panic message and stack trace doesn't give us any clue of which of them was failing.. so we had to delete all of them :sadpanda: Let me check if it's safe to share this backup, then I can get back to you.

But what was clear for us is that behind the alpha flag the controller must have changed the status of some PipelineRun/TaskRun to something invalid, because even when we rollbacked we still had this issue.

RafaeLeal avatar Aug 04 '22 15:08 RafaeLeal

Thanks, I appreciate it - feel free to scrub any information you need to, of course. I think a Pipeline is more of what we need than a PipelineRun. It would definitely be one where $(tasks.someTask.results...) is referenced in another task, if that helps limit things at all.

abayer avatar Aug 04 '22 15:08 abayer

Ok, @RafaeLeal sent me a dump of the PipelineRuns, and so far I'm not seeing anything obvious. Right now, my theory is that it's something specific to changing to alpha mid-execution...which is a difficult thing to reproduce in a test. =) If I can't get any further with a reproduction, I'll just open a PR making the code change I mentioned above without a reproduction test - I'm very confident that this particular bug will be fixed by that, though there may well be another bug exposed once that one's gone, etc.

abayer avatar Aug 04 '22 17:08 abayer

I'm very confident that this particular bug will be fixed by that, though there may well be another bug exposed once that one's gone, etc.

I am very nervous with this kind fix as it might introduce more bugs.

@RafaeLeal how useful is it for you to know why a task was skipped? You might provide a great testimony if the why skipped is useful at all.

pritidesai avatar Aug 04 '22 21:08 pritidesai

I'm very confident that this particular bug will be fixed by that, though there may well be another bug exposed once that one's gone, etc.

I am very nervous with this kind fix as it might introduce more bugs.

@RafaeLeal how useful is it for you to know why a task was skipped? You might provide a great testimony if the why skipped is useful at all.

Well, I'm not actually interested in SkippingReason, but this is creating an infinite recursion loop, which is making the controller unhealthy

RafaeLeal avatar Aug 04 '22 21:08 RafaeLeal

Yeah, the root problem here is some behavior going weird when changing enable-api-fields to alpha while a PipelineRun is running, with the result that a possible infinite loop when calling skipBecauseResultReferencesAreMissing gets hit. The fix I’m proposing isn’t a fix for whatever is going weird when changing the flag, but very specifically cutting out an unnecessary call to Skip, thereby getting rid of the possible infinite loop without changing any actual behavior.

abayer avatar Aug 04 '22 21:08 abayer

related: https://github.com/tektoncd/pipeline/pull/3524

lbernick avatar Aug 10 '22 13:08 lbernick

@RafaeLeal this issue is reported for v0.35.* and since we are not able to reproduce it, can you please test this with latest pipeline release v0.38.*?

pritidesai avatar Aug 10 '22 18:08 pritidesai

@RafaeLeal this issue is reported for v0.35.* and since we are not able to reproduce it, can you please test this with latest pipeline release v0.38.*?

To be honest I don't have a way to reproduce it myself. At least not a way without causing an incident.

RafaeLeal avatar Aug 11 '22 02:08 RafaeLeal

Ok, I've gotten a bit closer to reproducing this. I dove into the dump @RafaeLeal gave me, which is all PipelineRuns with embedded pipeline and task specs, and found a number of PipelineRuns with spec.params containing things like

- name: artifact-helpers
  value: $(tasks.setup.results.artifact-helpers)

...and then the setup task has

params:
- name: artifact-helpers
  value: $(params.artifact-helpers)

This PipelineRun does not have a .status field, which makes sense, since this should hit infinite recursion trying to resolve a task with a param referencing its own result.

I wrote up a test case matching that circular behavior, and it did in fact hit infinite recursion, either with alpha or stable. Ok, that's what one would expect to happen - in practice, we should maybe have a validation check to make sure that you're not doing a recursive param/results reference.

Things get really odd when I check the full dump for any other cases like this - I found a total of 12 PipelineRuns out of 278 with no .status, params on the PipelineRun with at least one of them having a value of $(tasks.[some task].results.[some result]), and a task named [some task] with a param with a value of $(params.[some PR param]) and [some PR param] is one of the PR params with a results-containing param value. There are plenty of almost-identical PRs without that self-referential param/results loop, and all of those have .statuses.

I'm currently waiting for @RafaeLeal to let me know if the tooling they're using to create their PipelineRuns often adds those self-referential params/results, or if this was some aberration that only happened once they switched to alpha.

abayer avatar Aug 17 '22 17:08 abayer

Also worth mentioning that each of the "failed" PRs were created after they switched to alpha, though not every "valid" PR was created before then. Valid PRs were created both before and after the alpha switch, invalid only after.

abayer avatar Aug 17 '22 17:08 abayer

A little more detail on that: the enable-api-fields flag was set to alpha at 19:21:45 UTC, and all the PipelineRuns with self-referential params/results were creating over the next ~13 minutes, then all subsequent PipelineRuns of the relevant jobs didn't have the self-referential params/results. So either something went weird in @RafaeLeal's tooling in that time period, or something went weird in the actual creation of PipelineRuns during that period. It could be that ~19:35/36 is when @RafaeLeal changed the flag back to stable? Feels like it could be a webhook thing?

abayer avatar Aug 17 '22 17:08 abayer

Ah-ha! I've got it! With 0.35.1, this test fails with the stack overflow:

func TestInfiniteRecursion(t *testing.T) {
	names.TestingSeed()
	prs := []*v1beta1.PipelineRun{parse.MustParsePipelineRun(t, `
metadata:
  name: test-pipeline-run-different-service-accs
  namespace: foo
spec:
  params:
  - name: foo-value
    value: $(tasks.a-task.results.aResult)
  pipelineSpec:
    tasks:
    # a-task is executed and produces a result aResult with value aResultValue
    - name: a-task
      taskSpec:
        results:
        - description: a result
          name: aResult
        steps:
        - name: mystep
          image: myimage
    # b-task is skipped because it has when expressions, with result reference to a-task, that evaluate to false
    - name: b-task
      taskRef:
        name: b-task
      when:
      - input: $(tasks.a-task.results.aResult)
        operator: in
        values:
        - notResultValue
    # c-task is executed regardless of running after skipped b-task because when expressions are scoped to task
    - name: c-task
      runAfter:
      - b-task
      taskRef:
        name: c-task
  serviceAccountName: test-sa-0
`)}
	ts := []*v1beta1.Task{
		{ObjectMeta: baseObjectMeta("b-task", "foo")},
		{ObjectMeta: baseObjectMeta("c-task", "foo")},
	}
	cms := []*corev1.ConfigMap{withEnabledAlphaAPIFields(newFeatureFlagsConfigMap())}

	d := test.Data{
		PipelineRuns: prs,
		Tasks:        ts,
		ConfigMaps:   cms,
	}
	prt := newPipelineRunTest(d, t)
	defer prt.Cancel()

	wantEvents := []string{
		"Normal Started",
		"Normal Running Tasks Completed: 1 \\(Failed: 0, Cancelled 0\\), Incomplete: 1, Skipped: 1",
	}
	_, _ = prt.reconcileRun("foo", "test-pipeline-run-different-service-accs", wantEvents, false)
}

It doesn't hit infinite recursion without alpha set, and it doesn't hit infinite recursion on main. I'm pretty sure this is because of https://github.com/tektoncd/pipeline/commit/e24df8379e7f85e9305f2710e77ba2190c722532#diff-977842cbf8989cf951bfe55adc45bf63ac2ea8579d1c2c0fb8dddb127858d9f1 being removed in 0.36.0 - the original implicit parameters behavior would add params to the pipelineSpec and taskSpecs to match the pipelineRun.params, and since the pipelineRun.params contain $(tasks.a-task.results.aResult), we would end up with the self-referential hell that causes the infinite recursion!

So! I believe this means that Pipeline v0.36.0 and greater shouldn't hit this error. Though we probably should seriously consider failing validation if pipelineRun.params[0].value is something like $(tasks.a-task.results.aResult).

abayer avatar Aug 17 '22 18:08 abayer

yes, certainly 👍

pipelineRun.params should be validated to restrict task/pipeline result references.

pritidesai avatar Aug 17 '22 18:08 pritidesai

@RafaeLeal confirmed it - enable-api-fields was set back to stable at 19:36. We've definitely found what the problem was, and it doesn't exist in v0.36.0 and later. I'm not going to close this, though - I'm trying to figure out the right way to validate PipelineRunSpec.Params to reject result references, and once I've nailed that, I'll open a PR.

abayer avatar Aug 18 '22 12:08 abayer

thank you @abayer, I have updated the issue title and reprioritized it.

pritidesai avatar Aug 19 '22 06:08 pritidesai

@abayer should we also validate that the parameters are not duplicated? This was also happening because of misconfiguration and no error was raised

RafaeLeal avatar Aug 19 '22 16:08 RafaeLeal

@RafaeLeal Good question. I'm honestly not sure why we weren't already failing for that...

abayer avatar Aug 19 '22 16:08 abayer