serving
serving copied to clipboard
Fix: Raise the reconciliation timeout from 10 to 30s.
This was actually something I had asked for a fairly long time ago, which I had wanted to move into genreconciler to encourage folks to move "long" work off of the main reconciliation thread. 10s was sort of pulled out of thin air, and seemed like a good idea at the time! However, this arbitrary choice has come back to bite me!
Investigating a rash of timeouts in the sigstore policy-controller webhook, I noticed that a great deal of requests were timing out at 10s despite the webhook having a 25s timeoutSeconds.
Breaking this down by resource kind, I found that the only resources timing out at 9-10s were ALL deployments, where replicasets and pods had some timeouts, but almost all were at the 25s mark.
So basically when Knative reconciles Deployments, it is setting a 10s deadline on its outbound API requests, including Deployment updates, which pass this deadline on to downstream requests such as webhooks.
My reasoning for advocating for 30s is not arbitrary, it is based on this being the maximum value that a webhook can have for its timeoutSeconds field: https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#timeouts
This means that Knative wouldn't be artificially lowering the webhook timeout on any webhooks configured on the resources it is managing directly.
Increase the outbound context deadline in reconcilers to 30s (from 10s) to match the maximum K8s webhook timeout.
cc @dprotaso
cc @evankanderson @vaikas
Codecov Report
Base: 86.50% // Head: 86.45% // Decreases project coverage by -0.04% :warning:
Coverage data is based on head (
83dfafa) compared to base (2332731). Patch coverage: 73.33% of modified lines in pull request are covered.
Additional details and impacted files
@@ Coverage Diff @@
## main #13323 +/- ##
==========================================
- Coverage 86.50% 86.45% -0.05%
==========================================
Files 196 196
Lines 14544 14548 +4
==========================================
- Hits 12581 12578 -3
- Misses 1664 1670 +6
- Partials 299 300 +1
| Impacted Files | Coverage Δ | |
|---|---|---|
| cmd/controller/main.go | 0.00% <0.00%> (ø) |
|
| pkg/reconciler/autoscaling/hpa/hpa.go | 91.66% <100.00%> (ø) |
|
| pkg/reconciler/autoscaling/kpa/kpa.go | 95.26% <100.00%> (ø) |
|
| pkg/reconciler/configuration/configuration.go | 82.93% <100.00%> (-1.43%) |
:arrow_down: |
| pkg/reconciler/domainmapping/reconciler.go | 93.68% <100.00%> (ø) |
|
| pkg/reconciler/gc/reconciler.go | 100.00% <100.00%> (ø) |
|
| pkg/reconciler/labeler/labeler.go | 100.00% <100.00%> (ø) |
|
| pkg/reconciler/nscert/nscert.go | 72.03% <100.00%> (ø) |
|
| pkg/reconciler/revision/revision.go | 92.13% <100.00%> (ø) |
|
| pkg/reconciler/route/route.go | 80.06% <100.00%> (ø) |
|
| ... and 3 more |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Changing one hard coded value for something else with no way to override does not seem ideal. Since it seems like some resources will maybe require specifying different values, I think my order of preference would be:
- make this configurable across the board as a config option
- make this configurable by resource type
But I could be missing something obvs here :)
@vaikas I think my preference would be to make this a common global variable which folks can then choose to expose as a flag via flag.DurationVar, but I don't have a preference. Where would you want it?
re: 30s
Unlike before, this wasn't arbitrary, it is the maximum webhook timeoutSeconds. That said, some reconcilers reconcile multiple resources, which is why I'm definitely receptive to something configurable, but I'm too tired from debugging this to have strong opinions on where it goes or what it's called 🤣
Yeah, I think a global flag is fine as long as it's configurable I'm fine with it. Would it be bad to put it in knative.dev/pkg/apis/contexts.go which has bunch of contexts already? or knative.dev/pkg/reconciler/ since it contains already the common reconcile shape? Only thinking that way other things using Knative could get bennies from it too?
I opened https://github.com/knative/pkg/pull/2597 for this. Since this is intended to bound reconciliation, I went with reconciler.DefaultTimeout. Once that goes in, I'll rebase this once the nightly dep update merges it into serving.
@evankanderson see https://github.com/knative/pkg/pull/2597
/lgtm
/hold
I don't want this going in before it uses the global
/unhold
Ok, this is rebased on the latest deps update and using the global now.
Do we want to expose a flag in the main controller so that this value is configurable, or wait until the new raised limit is insufficient?
Do we want to expose a flag in the main controller so that this value is configurable, or wait until the new raised limit is insufficient?
I'd lean towards exposing it now.
Ok, there is now a flag for this.
/lgtm /approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: mattmoor, psschwei, vaikas
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~cmd/OWNERS~~ [psschwei]
- ~~pkg/OWNERS~~ [psschwei]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment