networking icon indicating copy to clipboard operation
networking copied to clipboard

Make maximum delay of prober in its backoff configurable

Open SaschaSchwarze0 opened this issue 1 year ago • 5 comments

Changes

We are running a larger installation with Knative Serving + Istio + NetIstio. What we observe is that the more services there are, the longer it takes to provision a new KService. This is fine, but at some point, we observe that times increase a lot and containing jumps. E. g. we measure many values around 21 seconds and around 35 seconds but rarely in between. The image is always available so the time it takes to get the Configuration ready is quite stable. The differences we see in the Route.

We think that this has to do with the prober which is containing an exponential back-off with a maximum delay of 30 seconds.

The following table shows how long the prober takes depending on how many tries it needs ignoring the time it actually needs to perform the probe itself = just summing up the delays:

Try BackOff Sum
0 0 0
1 0.05 0.05
2 0.1 0.15
3 0.2 0.35
4 0.4 0.75
5 0.8 1.55
6 1.6 3.15
7 3.2 6.35
8 6.4 12.75
9 12.8 25.55
10 25.6 51.15
11 30 81.15
12 30 111.15

Those numbers very well match what we observe. For example for those provisions that took roughly 21 seconds, it needed eight probes (12.75 s) and when it needed nine probes (25.55 s) the overall provisioning took around 35 seconds. And yes, we also see provisions that need a little over a minute where it probably needed ten tries to probe successfully.

--

While we obviously spend time in improving this with Istio tuning, we would also like to play with lower values for the maximum delay as we think that higher delays (5 seconds and more) are reached too early = already after 6.35 seconds, and compared to the overall load in our system, the number of probes is so minimal that we think that more failed probes would not cause any harm.

This PR therefore introduces a PROBE_MAX_RETRY_DELAY_SECONDS environment variable that - if set - will be used as maximum delay.

If you think that makes no sense as a general change, then just close it. If you think it should be exposed in another way, let me know.

/kind enhancement

Release Note

You can now set a PROBE_MAX_RETRY_DELAY_SECONDS environment variable on your networking component to define a custom maximum retry delay of the prober that will be used instead of the default of 30 seconds.

SaschaSchwarze0 avatar Aug 20 '24 11:08 SaschaSchwarze0

Hi @SaschaSchwarze0. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

knative-prow[bot] avatar Aug 20 '24 11:08 knative-prow[bot]

/ok-to-test

skonto avatar Sep 05 '24 10:09 skonto

@SaschaSchwarze0 hi, thank you for the PR.

If you think that makes no sense as a general change, then just close it. If you think it should be exposed in another way, let me know.

@dprotaso any objections on this one? From a tech pov if it solves a problem I am fine, at the end of the day the default remains the same.

We think that this has to do with the prober which is containing an exponential back-off with a maximum delay of 30 seconds.

It would be nice to know how different values affect the large deployment to have at least some evidence.

skonto avatar Sep 05 '24 10:09 skonto

It would be nice to know how different values affect the large deployment to have at least some evidence.

This is a screenshot of one of our larger clusters with several thousands of KServices. It renders the duration between creationTimestamp and ready condition of a KService. Is always the same KService with the same image.

At 10:04, PROBE_MAX_RETRY_DELAY_SECONDS was set to 5 (before it was not set which means it was 30).

Unbenannt

The change obviously does not fix the spikes because that's when for example the Pod landed on a Node that did not have the image present or so. But we can see how it brings down the time whenever things are actually ready, but the prober previously waited a longer time to check that again.

SaschaSchwarze0 avatar Sep 05 '24 12:09 SaschaSchwarze0

@dprotaso @skonto just noticed this did not move forward. Anything else you need from me?

SaschaSchwarze0 avatar Oct 28 '24 13:10 SaschaSchwarze0

/lgtm /approve

/hold for @ReToCode

skonto avatar Nov 15 '24 11:11 skonto

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: SaschaSchwarze0, skonto

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

knative-prow[bot] avatar Nov 15 '24 11:11 knative-prow[bot]

/unhold

skonto avatar Dec 13 '24 08:12 skonto

@SaschaSchwarze0 hi could you do the follow up PR at the net-istio side?

skonto avatar Dec 13 '24 08:12 skonto

@SaschaSchwarze0 hi could you do the follow up PR at the net-istio side?

Done @skonto. Here it is: https://github.com/knative-extensions/net-istio/pull/1395

SaschaSchwarze0 avatar Jan 14 '25 08:01 SaschaSchwarze0