Added support for overload manager

Open tsaarni opened this issue 2 years ago • 5 comments

This change adds minimal support for Envoy's overload manager to avoid cases where Envoy process is terminated by the out-of-memory killer, which results in traffic distrubances.

This PR proposes that administrator can (optionally) configure maximum amount of heap that Envoy is allowed to reserve. It does not allow the overload actions to be added or configured in any way. Instead, it configures default actions which are set according to example in "Configuring Envoy as an edge proxy" best practices doc: shrink_heap action is executed when 95% and stop_accepting_requests action when 98% of configured maximum heap is reached.

The configuration of maximum heap is (unfortunately) again a new command line flag. This is the same for all other bootstrap paramters so far as well. The reasoning is that contour bootsrap is executed inside Envoy init container, where we do not have Contour config file or capability to read ContourConfiguration CR from the API server. Or at least we have not done this so far.

There is a major conflict between overload manager and how we expose /ready and /stats by setting up a proxy to serve these endpoints!

While the real admin API at /admin/admin.sock still works during overload, the requests via the "proxied versions" of /ready and /stats served via TCP socket will be rejected when stop_accepting_requests is active. As a result, Envoy will be removed from the endpoints of the service. Since the Envoy instance was not accepting new requests anyways, maybe this will not make the overload any worse for the other Envoys. However, another side-effect is that administrator cannot monitor the Envoy instance anymore since stats endpoint will not be served either. Especially the memory related metrics would be of interest, since metrics will show that the stop_accepting_requests action is active and the heap numbers explaining why. When admin sets max heap too low, they will not be able to find that out by checking metrics - since metrics are not served due to heap being low :thinking:

The feature itself seems very useful as it can avoid the OOM killer but I'd like to hear your opinion about the limitations.

Fixes #1794

Signed-off-by: Tero Saarni [email protected]

As a workaround, I found out following commands helpful to access the "real" admin API when the "proxied" admin API endpoints are rejecting requests due to overload:

sudo curl --silent --unix-socket /proc/$(pidof envoy)/root/admin/admin.sock http://localhost/stats | grep -E "^overload|^server.memory"
sudo curl --silent --unix-socket /proc/$(pidof envoy)/root/admin/admin.sock http://localhost/memory # tcmalloc metrics

These need to be executed on the worker node, or just on the dev host when running Kind.

Envoy's fixed_heap monitor uses the tcmalloc metrics at /memory and following formula to calculate the overload percentage: (heap_size - pageheap_unmapped) / maximum_heap

Jun 28 '22 13:06 tsaarni

Codecov Report

Merging #4597 (f10ff6b) into main (d8553a8) will increase coverage by 0.14%. The diff coverage is 97.50%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4597      +/-   ##
==========================================
+ Coverage   76.08%   76.23%   +0.14%     
==========================================
  Files         140      140              
  Lines       13073    13147      +74     
==========================================
+ Hits         9947    10023      +76     
+ Misses       2872     2871       -1     
+ Partials      254      253       -1

Impacted Files	Coverage Δ
cmd/contour/bootstrap.go	`0.00% <0.00%> (ø)`
internal/envoy/bootstrap.go	`55.88% <ø> (ø)`
internal/envoy/v3/bootstrap.go	`94.33% <100.00%> (+0.82%)`	:arrow_up:
internal/dag/dag.go	`95.53% <0.00%> (ø)`
internal/envoy/v3/route.go	`73.95% <0.00%> (+0.21%)`	:arrow_up:
internal/sorter/sorter.go	`98.79% <0.00%> (+0.60%)`	:arrow_up:
internal/dag/httpproxy_processor.go	`92.80% <0.00%> (+0.65%)`	:arrow_up:

Jun 28 '22 14:06 codecov[bot]

I think in cases that the heap is low, getting metrics is definitely less of a big deal than being oomkilled. It seems like this is about as good a compromise as we're going to be able to do, sadly.

It's unfortunate that we have to make the heap size a bootstrap cmdline param, but I don't see any other way to do it.

I also think that this feature has to come with a bunch of warnings about being careful with your sizing, making sure that it matches up with any Pod requests and limits you've put on Envoy, and so on.

I'll give the PR a more detailed review soon, sorry about the delay @tsaarni.

Jul 07 '22 04:07 youngnick

Marking this PR stale since there has been no activity for 14 days. It will be closed if there is no activity for another 30 days.

Jul 22 '22 00:07 github-actions[bot]

Marking this PR stale since there has been no activity for 14 days. It will be closed if there is no activity for another 30 days.

Aug 10 '22 00:08 github-actions[bot]

I will come back with some documentation shortly.

Aug 10 '22 04:08 tsaarni

Rebased and missing documentation site/content/docs/main/config/overload-manager.md added.

Aug 18 '22 11:08 tsaarni

Marking this PR stale since there has been no activity for 14 days. It will be closed if there is no activity for another 30 days.

Sep 03 '22 00:09 github-actions[bot]

This PR is ready for review.

Sep 05 '22 12:09 tsaarni

Sorry for the delay on this @tsaarni, planning to take a look soon!

Sep 07 '22 21:09 skriss

Thanks @sunjayBhatia for the review!

this and the below might need an update to match the bootstrap config (looks like 90% and 98% for these two actions)

Thanks for spotting this! I went the other direction and changed the bootstrap config to match the documentation, since values came from Envoy best practices document. I think I had no real reason to use 90% instead of 95%.

Sep 14 '22 05:09 tsaarni

If there are no further questions, I'll merge this tomorrow.

Sep 15 '22 12:09 tsaarni

@skriss Thank you for the review!

Just a couple tiny things but this looks good to me. The issue with the admin endpoint is unfortunate, and maybe we can do some more thinking there on if there's something else we can do, but I don't think it needs to block getting the initial PR in, seems like a net improvement.

I agree. I could not figure anything that could be done on Contour side, except removing the proxy from the admin API, but that is there for a reason. Overload manager cannot be applied directly on named listeners only, or the other way around: it cannot be configured to ignore certain listeners...

Sep 16 '22 16:09 tsaarni

contour contour copied to clipboard

Added support for overload manager

Codecov Report

contour
contour copied to clipboard