contour icon indicating copy to clipboard operation
contour copied to clipboard

Contour support for Envoy's stats per route

Open izturn opened this issue 3 years ago • 18 comments

Please describe the problem you have At https://github.com/envoyproxy/envoy/issues/3351 @stevesloka advise envoy to expose metrics per vhost, now this feature has been released along with envoy v1.23 as route-stat-prefix (the pr is https://github.com/envoyproxy/envoy/pull/21302), shall we want to support it too?

izturn avatar Jul 25 '22 10:07 izturn

Any plan for this feature support?

wilsonwu avatar Jul 27 '22 07:07 wilsonwu

Seems reasonable to support with a few considerations:

  • document which stats will be enabled (https://www.envoyproxy.io/docs/envoy/v1.23.0/configuration/http/http_filters/router_filter#config-http-filters-router-vcluster-stats)
  • we should document the resource impact this will have to each instance of Envoy (from the Envoy docs We do not recommend setting up a stat prefix for every application endpoint. This is both not easily maintainable and statistics use a non-trivial amount of memory(approximately 1KiB per route).)
  • do a bench/load test to see if it has a noticeable impact the way Contour programs routes etc.
    • this will mean
  • consider if this should be an opt-in feature

We would definitely take community contributions to help speed this up, otherwise we've got this prioritized for 1.24.0 currently

sunjayBhatia avatar Aug 04 '22 21:08 sunjayBhatia

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

  • Mark this Issue as fresh by commenting
  • Close this Issue
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions[bot] avatar Oct 04 '22 00:10 github-actions[bot]

We are planning to do tests for this feature, update later.

wilsonwu avatar Oct 08 '22 03:10 wilsonwu

Load test results come:

We did two rounds load test:

  1. 10k routes without vhost metrics
  2. 10k routes with vhost metrics

All tests env is 1 instance 4C4G envoy v1.23

Test results: The 1st round: Below image show envoy start CPU and memory: CPU 2%, memory 6% (almost 250m) image After sent requests to 10k routes randomly, CPU and memory like below: CPU 400%, memory 8% (almost 350m) image

The 2nd round: Below image show envoy start CPU and memory: CPU 2% - 3%, memory 7.5% (almost 330m) image

After sent requests to 10k routes randomly, CPU and memory like below: CPU 400%, memory 10% (almost 450m) image

I think the vhost metrics only make envoy start memory high, for load performance, it is fine.

Hope this test can help you make decision.

wilsonwu avatar Oct 25 '22 11:10 wilsonwu

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

  • Mark this Issue as fresh by commenting
  • Close this Issue
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions[bot] avatar Dec 25 '22 00:12 github-actions[bot]

Merry Christmas to guys, if any discussion need, let's going on.

wilsonwu avatar Dec 25 '22 10:12 wilsonwu

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

  • Mark this Issue as fresh by commenting
  • Close this Issue
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions[bot] avatar Feb 24 '23 00:02 github-actions[bot]

Sorry for the lack of responses on this one @wilsonwu will try to look at this again soon!

sunjayBhatia avatar Feb 27 '23 21:02 sunjayBhatia

Sorry for the lack of responses on this one @wilsonwu will try to look at this again soon!

Thanks Sunjay, if the test result acceptable, we can move on for some design work.

wilsonwu avatar Feb 28 '23 05:02 wilsonwu

Hi guys, let's going on, @sunjayBhatia, any update for this.

wilsonwu avatar Apr 13 '23 03:04 wilsonwu

@wilsonwu I'm going to add this to the 1.26 milestone for now and will plan to look at it after 1.25 is released at the end of this month.

skriss avatar Apr 13 '23 13:04 skriss

@wilsonwu I'm going to add this to the 1.26 milestone for now and will plan to look at it after 1.25 is released at the end of this month.

Good to hear that, we will starting contribute it.

wilsonwu avatar Apr 13 '23 14:04 wilsonwu

Considering this feature has not been implemented yet, I wonder if there's an alternative option to monitor aggregated traffic of a HTTPProxy/Ingress in contour? envoy metrics show the traffic of each backend pod and I can't see an easy way to relate them to a specified HTTPProxy/Ingress object especially if multiple HTTPProxy/Ingress objects point to the same service/pods

alibo avatar Jun 26 '23 11:06 alibo

@wilsonwu sorry this is so late but when doing the experiment above, did you use a static stat prefix for all routes associated with a virtualhost or do something similar to what is described here: https://github.com/projectcontour/contour/pull/5535#issuecomment-1634646647 ? Naively I'm thinking a static stat prefix would have less resource impact and also not offer the granularity needed to actually differentiate the stats between different routes on a route/upstream

sunjayBhatia avatar Jul 13 '23 17:07 sunjayBhatia

@wilsonwu sorry this is so late but when doing the experiment above, did you use a static stat prefix for all routes associated with a virtualhost or do something similar to what is described here: #5535 (comment) ? Naively I'm thinking a static stat prefix would have less resource impact and also not offer the granularity needed to actually differentiate the stats between different routes on a route/upstream

Sorry for the late reply, already comment in the PR.

wilsonwu avatar Jul 31 '23 03:07 wilsonwu

per the ongoing discussion on the related PR, looks like this will slip to 1.27.0

sunjayBhatia avatar Aug 09 '23 00:08 sunjayBhatia