contour Contour support for Envoy's stats per route

Please describe the problem you have At https://github.com/envoyproxy/envoy/issues/3351 @stevesloka advise envoy to expose metrics per vhost, now this feature has been released along with envoy v1.23 as route-stat-prefix (the pr is https://github.com/envoyproxy/envoy/pull/21302), shall we want to support it too?

Jul 25 '22 10:07 izturn

Any plan for this feature support?

Jul 27 '22 07:07 wilsonwu

Seems reasonable to support with a few considerations:

document which stats will be enabled (https://www.envoyproxy.io/docs/envoy/v1.23.0/configuration/http/http_filters/router_filter#config-http-filters-router-vcluster-stats)
we should document the resource impact this will have to each instance of Envoy (from the Envoy docs We do not recommend setting up a stat prefix for every application endpoint. This is both not easily maintainable and statistics use a non-trivial amount of memory(approximately 1KiB per route).)
do a bench/load test to see if it has a noticeable impact the way Contour programs routes etc.
- this will mean
consider if this should be an opt-in feature

We would definitely take community contributions to help speed this up, otherwise we've got this prioritized for 1.24.0 currently

Aug 04 '22 21:08 sunjayBhatia

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

Mark this Issue as fresh by commenting
Close this Issue
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

Oct 04 '22 00:10 github-actions[bot]

We are planning to do tests for this feature, update later.

Oct 08 '22 03:10 wilsonwu

Load test results come:

We did two rounds load test:

10k routes without vhost metrics
10k routes with vhost metrics

All tests env is 1 instance 4C4G envoy v1.23

Test results: The 1st round: Below image show envoy start CPU and memory: CPU 2%, memory 6% (almost 250m) After sent requests to 10k routes randomly, CPU and memory like below: CPU 400%, memory 8% (almost 350m)

The 2nd round: Below image show envoy start CPU and memory: CPU 2% - 3%, memory 7.5% (almost 330m)

After sent requests to 10k routes randomly, CPU and memory like below: CPU 400%, memory 10% (almost 450m)

I think the vhost metrics only make envoy start memory high, for load performance, it is fine.

Hope this test can help you make decision.

Oct 25 '22 11:10 wilsonwu

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

Mark this Issue as fresh by commenting
Close this Issue
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

Dec 25 '22 00:12 github-actions[bot]

Merry Christmas to guys, if any discussion need, let's going on.

Dec 25 '22 10:12 wilsonwu

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

Mark this Issue as fresh by commenting
Close this Issue
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

Feb 24 '23 00:02 github-actions[bot]

Sorry for the lack of responses on this one @wilsonwu will try to look at this again soon!

Feb 27 '23 21:02 sunjayBhatia

Sorry for the lack of responses on this one @wilsonwu will try to look at this again soon!

Thanks Sunjay, if the test result acceptable, we can move on for some design work.

Feb 28 '23 05:02 wilsonwu

Hi guys, let's going on, @sunjayBhatia, any update for this.

Apr 13 '23 03:04 wilsonwu

@wilsonwu I'm going to add this to the 1.26 milestone for now and will plan to look at it after 1.25 is released at the end of this month.

Apr 13 '23 13:04 skriss

@wilsonwu I'm going to add this to the 1.26 milestone for now and will plan to look at it after 1.25 is released at the end of this month.

Good to hear that, we will starting contribute it.

Apr 13 '23 14:04 wilsonwu

Considering this feature has not been implemented yet, I wonder if there's an alternative option to monitor aggregated traffic of a HTTPProxy/Ingress in contour? envoy metrics show the traffic of each backend pod and I can't see an easy way to relate them to a specified HTTPProxy/Ingress object especially if multiple HTTPProxy/Ingress objects point to the same service/pods

Jun 26 '23 11:06 alibo

@wilsonwu sorry this is so late but when doing the experiment above, did you use a static stat prefix for all routes associated with a virtualhost or do something similar to what is described here: https://github.com/projectcontour/contour/pull/5535#issuecomment-1634646647 ? Naively I'm thinking a static stat prefix would have less resource impact and also not offer the granularity needed to actually differentiate the stats between different routes on a route/upstream

Jul 13 '23 17:07 sunjayBhatia

@wilsonwu sorry this is so late but when doing the experiment above, did you use a static stat prefix for all routes associated with a virtualhost or do something similar to what is described here: #5535 (comment) ? Naively I'm thinking a static stat prefix would have less resource impact and also not offer the granularity needed to actually differentiate the stats between different routes on a route/upstream

Sorry for the late reply, already comment in the PR.

Jul 31 '23 03:07 wilsonwu

per the ongoing discussion on the related PR, looks like this will slip to 1.27.0

Aug 09 '23 00:08 sunjayBhatia