contour
contour copied to clipboard
internal/contour: HoldoffMaxDelay timer should increase if there are pending updates
Currently, in handler.go, the holdoff timer has a maximum delay, after which it will fire a forced update.
As part of an investigation into large amount of memory being used by Envoy at startup, we came up with the theory that, if it takes longer than HoldoffMaxDelay to process all the objects in a cluster, we could end up firing multiple times, maybe a lot of times.
This would create a lot of separate Envoy configs very quickly, which would all also drain quickly (since there would be very few connections actually using each one).
A way to mitigate this may be to dynamically manage the holdoffMaxDelay, and increase it for the next time if updates are pending when a max-delay update occurs. It should reset to its original value once updates complete draining.
This would produce an backoff-style behaviour in Envoy updates when there are a lot of updates pending (such as at startup), which should hopefully help with not generating as many.
I'd like take a stab at this. The goal of holdoffMaxDelay was to prevent the holdoffDelay timer being sloloris'd forward indefinitely. At the moment holdoffMaxDelay will fire twice per second, even in the face of continual updates. It's reasonable that if we hit holdoffMaxDelay then there have been continual updates for that period of time so we can assume they will continue to arrive indefinitely. Doubling holdoffMaxDelay, if we hit it, should be safe to do because in the event holdoffMaxDelay has been extended many times, the holdoff timer will fire after holdoffDelay, a much shorter duration.
Hello,
I've noodled with this for a few days and landed some cleanup PRs for 1.3 however I have decided not to proceed any further with this work. This is for several reasons.
- After investigation of the interaction between envoy and contour during startup the following scenarios are likely. a. Contour starts before envoy; in this case Contour will have processed the updates from the API server before envoy connects. Envoy will see one update. b. Envoy starts before contour; this is actually a, but in a different guise. Because Contour does not open its xDS listening socket until the informers have synced during the startup phase Contour will not open its xDS socket, so envoy, unless it is extremely likely, will not connect and enter a backoff retry loop. When it does connect, the observed behaviour looks like a.
- I wanted to extract the holdout logic into its own type which would take a set of notify messages as input and return a delay value. This would isolate the holdout logic from having to test that something did not run for at least a duration. Unfortunately I haven't found a nice API for this logic yet.
On balance I think this is something that is worthwhile doing, but I can't justify it now with the current issue load.
The Contour project currently lacks enough contributors to adequately respond to all Issues.
This bot triages Issues according to the following rules:
- After 60d of inactivity, lifecycle/stale is applied
- After 30d of inactivity since lifecycle/stale was applied, the Issue is closed
You can:
- Mark this Issue as fresh by commenting
- Close this Issue
- Offer to help out with triage
Please send feedback to the #contour channel in the Kubernetes Slack
The Contour project currently lacks enough contributors to adequately respond to all Issues.
This bot triages Issues according to the following rules:
- After 60d of inactivity, lifecycle/stale is applied
- After 30d of inactivity since lifecycle/stale was applied, the Issue is closed
You can:
- Mark this Issue as fresh by commenting
- Close this Issue
- Offer to help out with triage
Please send feedback to the #contour channel in the Kubernetes Slack