serving icon indicating copy to clipboard operation
serving copied to clipboard

Fix: Raise the reconciliation timeout from 10 to 30s.

Open mattmoor opened this issue 3 years ago • 13 comments

This was actually something I had asked for a fairly long time ago, which I had wanted to move into genreconciler to encourage folks to move "long" work off of the main reconciliation thread. 10s was sort of pulled out of thin air, and seemed like a good idea at the time! However, this arbitrary choice has come back to bite me!

Investigating a rash of timeouts in the sigstore policy-controller webhook, I noticed that a great deal of requests were timing out at 10s despite the webhook having a 25s timeoutSeconds.

Breaking this down by resource kind, I found that the only resources timing out at 9-10s were ALL deployments, where replicasets and pods had some timeouts, but almost all were at the 25s mark.

So basically when Knative reconciles Deployments, it is setting a 10s deadline on its outbound API requests, including Deployment updates, which pass this deadline on to downstream requests such as webhooks.

My reasoning for advocating for 30s is not arbitrary, it is based on this being the maximum value that a webhook can have for its timeoutSeconds field: https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#timeouts

This means that Knative wouldn't be artificially lowering the webhook timeout on any webhooks configured on the resources it is managing directly.

Increase the outbound context deadline in reconcilers to 30s (from 10s) to match the maximum K8s webhook timeout.

cc @dprotaso

mattmoor avatar Sep 20 '22 22:09 mattmoor

cc @evankanderson @vaikas

mattmoor avatar Sep 20 '22 22:09 mattmoor

Codecov Report

Base: 86.50% // Head: 86.45% // Decreases project coverage by -0.04% :warning:

Coverage data is based on head (83dfafa) compared to base (2332731). Patch coverage: 73.33% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #13323      +/-   ##
==========================================
- Coverage   86.50%   86.45%   -0.05%     
==========================================
  Files         196      196              
  Lines       14544    14548       +4     
==========================================
- Hits        12581    12578       -3     
- Misses       1664     1670       +6     
- Partials      299      300       +1     
Impacted Files Coverage Δ
cmd/controller/main.go 0.00% <0.00%> (ø)
pkg/reconciler/autoscaling/hpa/hpa.go 91.66% <100.00%> (ø)
pkg/reconciler/autoscaling/kpa/kpa.go 95.26% <100.00%> (ø)
pkg/reconciler/configuration/configuration.go 82.93% <100.00%> (-1.43%) :arrow_down:
pkg/reconciler/domainmapping/reconciler.go 93.68% <100.00%> (ø)
pkg/reconciler/gc/reconciler.go 100.00% <100.00%> (ø)
pkg/reconciler/labeler/labeler.go 100.00% <100.00%> (ø)
pkg/reconciler/nscert/nscert.go 72.03% <100.00%> (ø)
pkg/reconciler/revision/revision.go 92.13% <100.00%> (ø)
pkg/reconciler/route/route.go 80.06% <100.00%> (ø)
... and 3 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Sep 20 '22 22:09 codecov[bot]

Changing one hard coded value for something else with no way to override does not seem ideal. Since it seems like some resources will maybe require specifying different values, I think my order of preference would be:

  • make this configurable across the board as a config option
  • make this configurable by resource type

But I could be missing something obvs here :)

vaikas avatar Sep 20 '22 22:09 vaikas

@vaikas I think my preference would be to make this a common global variable which folks can then choose to expose as a flag via flag.DurationVar, but I don't have a preference. Where would you want it?

re: 30s

Unlike before, this wasn't arbitrary, it is the maximum webhook timeoutSeconds. That said, some reconcilers reconcile multiple resources, which is why I'm definitely receptive to something configurable, but I'm too tired from debugging this to have strong opinions on where it goes or what it's called 🤣

mattmoor avatar Sep 20 '22 22:09 mattmoor

Yeah, I think a global flag is fine as long as it's configurable I'm fine with it. Would it be bad to put it in knative.dev/pkg/apis/contexts.go which has bunch of contexts already? or knative.dev/pkg/reconciler/ since it contains already the common reconcile shape? Only thinking that way other things using Knative could get bennies from it too?

vaikas avatar Sep 20 '22 23:09 vaikas

I opened https://github.com/knative/pkg/pull/2597 for this. Since this is intended to bound reconciliation, I went with reconciler.DefaultTimeout. Once that goes in, I'll rebase this once the nightly dep update merges it into serving.

mattmoor avatar Sep 21 '22 01:09 mattmoor

@evankanderson see https://github.com/knative/pkg/pull/2597

mattmoor avatar Sep 21 '22 01:09 mattmoor

/lgtm

evankanderson avatar Sep 21 '22 02:09 evankanderson

/hold

I don't want this going in before it uses the global

mattmoor avatar Sep 21 '22 02:09 mattmoor

/unhold

Ok, this is rebased on the latest deps update and using the global now.

Do we want to expose a flag in the main controller so that this value is configurable, or wait until the new raised limit is insufficient?

mattmoor avatar Sep 21 '22 15:09 mattmoor

Do we want to expose a flag in the main controller so that this value is configurable, or wait until the new raised limit is insufficient?

I'd lean towards exposing it now.

psschwei avatar Sep 21 '22 20:09 psschwei

Ok, there is now a flag for this.

mattmoor avatar Sep 24 '22 15:09 mattmoor

/lgtm /approve

vaikas avatar Sep 25 '22 19:09 vaikas

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mattmoor, psschwei, vaikas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

knative-prow[bot] avatar Sep 25 '22 23:09 knative-prow[bot]