pipeline
pipeline copied to clipboard
Tekton Controller panics while using `enable-api-fields: alpha`
Expected Behavior
The alpha
features become available
Actual Behavior
The tekton-pipelines-controller crashed with this panic message: fatal error: stack overflow
.
I created a gist with the full stack trace https://gist.github.com/RafaeLeal/00c944a670bdba9a178b23da9eeb0e7c
It seems to have entered a 3-function infinite recursion loop:
- On function skipBecauseResultReferencesAreMissing we call Skip
- then on function Skip we call skip
- then on function skip we call back to skipBecauseResultReferencesAreMissing
But I don't know how we got to this loop, or if this logic is indeed meant to be recursive.
Steps to Reproduce the Problem
I'm not sure exactly how to reproduce this problem.. but I had a cluster with around ~400 pipelineruns running while I updated the config map feature-flags
to set enable-api-fields: alpha
.
Even after rolling back this change (setting enable-api-fields
back to stable
), the problem remained. Only after we deleted all pipelineruns we were able to restore the tekton-pipelines-controller
health.
Additional Info
-
Kubernetes version:
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.9-eks-a64ea69", GitCommit:"540410f9a2e24b7a2a870ebfacb3212744b5f878", GitTreeState:"clean", BuildDate:"2022-05-12T19:15:31Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
-
Tekton Pipeline version:
v0.35.1
This code path still exists in HEAD - makes me think this is likely not something fixed since 0.35.1.
/assign
My first thought is that rpt.Skip(facts).SkippingReason == v1beta1.WhenExpressionsSkip
should be replaced with rpt.skipBecauseWhenExpressionsEvaluatedToFalse(facts)
, since the former is basically a roundabout way of determining if the latter is true.
I'm going to work on writing a test to reproduce this now.
@RafaeLeal Any chance you could provide a PipelineRun
that hit this problem? Or the Pipeline
for one of those PipelineRun
s? I think this is somehow related to parameter behavior behind the alpha
flag, either array indexing or parameter propagation, so it'd be great to see what the actual Pipeline
looks like in terms of parameters and results...
@RafaeLeal Any chance you could provide a PipelineRun that hit this problem? Or the Pipeline for one of those PipelineRuns? I think this is somehow related to parameter behavior behind the alpha flag, either array indexing or parameter propagation, so it'd be great to see what the actual Pipeline looks like in terms of parameters and results...
So I did a backup from what was running on the cluster at that time.. but we didn't manage to understand which pipeline run was causing this issue... the panic message and stack trace doesn't give us any clue of which of them was failing.. so we had to delete all of them :sadpanda: Let me check if it's safe to share this backup, then I can get back to you.
But what was clear for us is that behind the alpha
flag the controller must have changed the status
of some PipelineRun/TaskRun to something invalid, because even when we rollbacked we still had this issue.
Thanks, I appreciate it - feel free to scrub any information you need to, of course. I think a Pipeline
is more of what we need than a PipelineRun
. It would definitely be one where $(tasks.someTask.results...)
is referenced in another task, if that helps limit things at all.
Ok, @RafaeLeal sent me a dump of the PipelineRun
s, and so far I'm not seeing anything obvious. Right now, my theory is that it's something specific to changing to alpha
mid-execution...which is a difficult thing to reproduce in a test. =) If I can't get any further with a reproduction, I'll just open a PR making the code change I mentioned above without a reproduction test - I'm very confident that this particular bug will be fixed by that, though there may well be another bug exposed once that one's gone, etc.
I'm very confident that this particular bug will be fixed by that, though there may well be another bug exposed once that one's gone, etc.
I am very nervous with this kind fix as it might introduce more bugs.
@RafaeLeal how useful is it for you to know why a task was skipped? You might provide a great testimony if the why skipped is useful at all.
I'm very confident that this particular bug will be fixed by that, though there may well be another bug exposed once that one's gone, etc.
I am very nervous with this kind fix as it might introduce more bugs.
@RafaeLeal how useful is it for you to know why a task was skipped? You might provide a great testimony if the why skipped is useful at all.
Well, I'm not actually interested in SkippingReason
, but this is creating an infinite recursion loop, which is making the controller unhealthy
Yeah, the root problem here is some behavior going weird when changing enable-api-fields
to alpha
while a PipelineRun
is running, with the result that a possible infinite loop when calling skipBecauseResultReferencesAreMissing
gets hit. The fix I’m proposing isn’t a fix for whatever is going weird when changing the flag, but very specifically cutting out an unnecessary call to Skip
, thereby getting rid of the possible infinite loop without changing any actual behavior.
related: https://github.com/tektoncd/pipeline/pull/3524
@RafaeLeal this issue is reported for v0.35.*
and since we are not able to reproduce it, can you please test this with latest pipeline release v0.38.*
?
@RafaeLeal this issue is reported for
v0.35.*
and since we are not able to reproduce it, can you please test this with latest pipeline releasev0.38.*
?
To be honest I don't have a way to reproduce it myself. At least not a way without causing an incident.
Ok, I've gotten a bit closer to reproducing this. I dove into the dump @RafaeLeal gave me, which is all PipelineRun
s with embedded pipeline and task specs, and found a number of PipelineRun
s with spec.params
containing things like
- name: artifact-helpers
value: $(tasks.setup.results.artifact-helpers)
...and then the setup
task has
params:
- name: artifact-helpers
value: $(params.artifact-helpers)
This PipelineRun
does not have a .status
field, which makes sense, since this should hit infinite recursion trying to resolve a task with a param referencing its own result.
I wrote up a test case matching that circular behavior, and it did in fact hit infinite recursion, either with alpha
or stable
. Ok, that's what one would expect to happen - in practice, we should maybe have a validation check to make sure that you're not doing a recursive param/results reference.
Things get really odd when I check the full dump for any other cases like this - I found a total of 12 PipelineRun
s out of 278 with no .status
, params
on the PipelineRun
with at least one of them having a value of $(tasks.[some task].results.[some result])
, and a task named [some task]
with a param with a value of $(params.[some PR param])
and [some PR param]
is one of the PR params with a results
-containing param value. There are plenty of almost-identical PRs without that self-referential param/results loop, and all of those have .status
es.
I'm currently waiting for @RafaeLeal to let me know if the tooling they're using to create their PipelineRun
s often adds those self-referential params/results, or if this was some aberration that only happened once they switched to alpha
.
Also worth mentioning that each of the "failed" PRs were created after they switched to alpha
, though not every "valid" PR was created before then. Valid PRs were created both before and after the alpha
switch, invalid only after.
A little more detail on that: the enable-api-fields
flag was set to alpha
at 19:21:45 UTC, and all the PipelineRun
s with self-referential params/results were creating over the next ~13 minutes, then all subsequent PipelineRun
s of the relevant jobs didn't have the self-referential params/results. So either something went weird in @RafaeLeal's tooling in that time period, or something went weird in the actual creation of PipelineRun
s during that period. It could be that ~19:35/36 is when @RafaeLeal changed the flag back to stable
? Feels like it could be a webhook thing?
Ah-ha! I've got it! With 0.35.1, this test fails with the stack overflow:
func TestInfiniteRecursion(t *testing.T) {
names.TestingSeed()
prs := []*v1beta1.PipelineRun{parse.MustParsePipelineRun(t, `
metadata:
name: test-pipeline-run-different-service-accs
namespace: foo
spec:
params:
- name: foo-value
value: $(tasks.a-task.results.aResult)
pipelineSpec:
tasks:
# a-task is executed and produces a result aResult with value aResultValue
- name: a-task
taskSpec:
results:
- description: a result
name: aResult
steps:
- name: mystep
image: myimage
# b-task is skipped because it has when expressions, with result reference to a-task, that evaluate to false
- name: b-task
taskRef:
name: b-task
when:
- input: $(tasks.a-task.results.aResult)
operator: in
values:
- notResultValue
# c-task is executed regardless of running after skipped b-task because when expressions are scoped to task
- name: c-task
runAfter:
- b-task
taskRef:
name: c-task
serviceAccountName: test-sa-0
`)}
ts := []*v1beta1.Task{
{ObjectMeta: baseObjectMeta("b-task", "foo")},
{ObjectMeta: baseObjectMeta("c-task", "foo")},
}
cms := []*corev1.ConfigMap{withEnabledAlphaAPIFields(newFeatureFlagsConfigMap())}
d := test.Data{
PipelineRuns: prs,
Tasks: ts,
ConfigMaps: cms,
}
prt := newPipelineRunTest(d, t)
defer prt.Cancel()
wantEvents := []string{
"Normal Started",
"Normal Running Tasks Completed: 1 \\(Failed: 0, Cancelled 0\\), Incomplete: 1, Skipped: 1",
}
_, _ = prt.reconcileRun("foo", "test-pipeline-run-different-service-accs", wantEvents, false)
}
It doesn't hit infinite recursion without alpha
set, and it doesn't hit infinite recursion on main
. I'm pretty sure this is because of https://github.com/tektoncd/pipeline/commit/e24df8379e7f85e9305f2710e77ba2190c722532#diff-977842cbf8989cf951bfe55adc45bf63ac2ea8579d1c2c0fb8dddb127858d9f1 being removed in 0.36.0 - the original implicit parameters behavior would add params
to the pipelineSpec
and taskSpec
s to match the pipelineRun.params
, and since the pipelineRun.params
contain $(tasks.a-task.results.aResult)
, we would end up with the self-referential hell that causes the infinite recursion!
So! I believe this means that Pipeline v0.36.0 and greater shouldn't hit this error. Though we probably should seriously consider failing validation if pipelineRun.params[0].value
is something like $(tasks.a-task.results.aResult)
.
yes, certainly 👍
pipelineRun.params
should be validated to restrict task/pipeline result references.
@RafaeLeal confirmed it - enable-api-fields
was set back to stable
at 19:36. We've definitely found what the problem was, and it doesn't exist in v0.36.0 and later. I'm not going to close this, though - I'm trying to figure out the right way to validate PipelineRunSpec.Params
to reject result references, and once I've nailed that, I'll open a PR.
thank you @abayer, I have updated the issue title and reprioritized it.
@abayer should we also validate that the parameters are not duplicated? This was also happening because of misconfiguration and no error was raised
@RafaeLeal Good question. I'm honestly not sure why we weren't already failing for that...