contour icon indicating copy to clipboard operation
contour copied to clipboard

Add envoy stat_prefix support for httpproxy

Open therealak12 opened this issue 1 year ago • 55 comments

This pull request addresses issue#4637. I'm not sure if this is a perfect implementation but I thought receiving feedback could help me implement it more properly.

Currently it adds a field in the envoy stanza of contour configuration, which, if set, will also be set in the envoy route config.

Another possible implementation would be to add a field to HTTPProxy spec, allowing users to set different StatPrefixes per route.

Also, as a third way to implement, we can set resource.name._resource.kind_resource.namespace as the stat_prefix!

therealak12 avatar Jun 29 '23 19:06 therealak12

Hi @therealak12! Welcome to our community and thank you for opening your first Pull Request. Someone will review it soon. Thank you for committing to making Contour better. You can also join us on our mailing list and in our channel in the Kubernetes Slack Workspace

github-actions[bot] avatar Jun 29 '23 19:06 github-actions[bot]

Screenshot from 2023-06-29 17-51-42

Here is a demo of the current implementation. There were 2 httpproxies in the cluster with example.com and example.com.ir hosts at the time.

therealak12 avatar Jul 01 '23 06:07 therealak12

I prefer this to xref: https://github.com/projectcontour/contour/pull/5539

izturn avatar Jul 04 '23 03:07 izturn

Do we think users will need to configure this per-route stat prefix? Is there any value in making this configurable? If not we can probably get away with just adding a boolean flag and configuring the stat prefix using some details of the Envoy (hashing the route match parameters for example) route to ensure uniqueness/ability to differentiate between different routes

In general this seems like it achieves what #4637 requires, an opt-in ability to signal Envoy to collect per-route upstream stats

This PR does something like what you've explained.

therealak12 avatar Jul 13 '23 17:07 therealak12

Do we think users will need to configure this per-route stat prefix? Is there any value in making this configurable? If not we can probably get away with just adding a boolean flag and configuring the stat prefix using some details of the Envoy (hashing the route match parameters for example) route to ensure uniqueness/ability to differentiate between different routes In general this seems like it achieves what #4637 requires, an opt-in ability to signal Envoy to collect per-route upstream stats

This PR does something like what you've explained.

I think it would be useful to modify that PR with a generated stat prefix that differentiates between different Envoy routes (maybe in addition to or instead of the existing stat prefix that differentiates between originating resources) and test it out with a couple different HTTPProxies/Ingresses/HTTPRoutes with multiple route rules

(I would guess that using different stat prefixes could possibly expand the memory footprint that was analysed in https://github.com/projectcontour/contour/issues/4637 as well, but would have to confirm to be sure)

cc @izturn @wilsonwu from #4637 is the goal to see stats between different route matches (e.g. see the upstream stats for requests to /foo vs /bar), if so I don't think the existing PRs do so?

sunjayBhatia avatar Jul 13 '23 17:07 sunjayBhatia

@sunjayBhatia Does it look good to you?

image

Please don't worry about the prefix section for now. It is just a temporary thing that I put in place to make the PoC work. I will improve the String() method if it is accepted.

These are the stats for an httpproxy with the following spec:

routes:
- conditions:
  - prefix: /etc
  permitInsecure: true
  services:
  - name: python-http-server
    port: 8000
- conditions:
  - prefix: /tmp
  permitInsecure: true
  services:
  - name: python-http-server
    port: 8000
virtualhost:
  fqdn: some-random-path.apps.private.okd4.ts-2.staging-snappcloud.io

It sets StatPrefix to a string built like this:

fmt.Sprintf("%s_%s_%s_%s", httpRoute.GetNamespace(), "httproute", httpRoute.GetName(), r.PathMatchCondition.String())

therealak12 avatar Jul 14 '23 14:07 therealak12

this seems more usable and now does separate the stats for different route rules

we'll have to figure out a stat prefix we can generate that can express uniqueness, but also can be used by an operator to figure out what route theyre looking at (e.g. not just an opaque hash)

would love to get @izturn and @wilsonwu's thoughts since they originally requested this

sunjayBhatia avatar Jul 14 '23 14:07 sunjayBhatia

This seems more usable and now does separate the stats for different route rules

We'll have to figure out a stat prefix we can generate that can express uniqueness, but also can be used by an operator to figure out what route they're looking at (e.g. not just an opaque hash)

would love to get @izturn and @wilsonwu's thoughts since they originally requested this

We had a conversation about the hash which I've replied here: https://kubernetes.slack.com/archives/C8XRH2R4J/p1689159578352139?thread_ts=1688670204.661949&cid=C8XRH2R4J

It seems the 60 characters limit can be safely removed now, and then we'll have the full hostname available in the stats.

therealak12 avatar Jul 14 '23 18:07 therealak12

One point to add here,

It's not that trivial to include the matchCondition in the stats.

There are various cases such as regex, exact, and prefix match. To distinguish between two routes, one having exactMatch set to /sample and the other having prefix set to /sample we should include the type of the match condition in the stat too.

Besides, there are query and header match conditions too.

A possible format for the stat_prefix would be:

namespace_ingressResourceType_ingressResourceName_matchType_matchValue

where matchType can be one of regex, prefix, or exact.

Although this supports all 3 match types, it doesn't count headers or query matches in. Trying to put those in the prefix makes it too complex and even useless.

It would be nice if we agree on a format before going on.

therealak12 avatar Jul 16 '23 17:07 therealak12

It's not that trivial to include the matchCondition in the stats.

Yep agreed, especially as there can be multiple match conditions in a route and these conditions can be complex to serialize into a string that isnt incredibly long

Some notes for discussion:

  • We could generate a name (based on the originating k8s resource name/route index or a hash based on match conditions) for each Envoy Route (see field here) which we do not do today and also use this as a stat prefix, which would mean users could look this up in their HTTPProxy/Ingress or Envoy config dump to coordinate with to see which stats line up with what
  • This sort of goes against Contours goal to abstract Envoy-isms from the user, but we could make the Route stat prefix configurable in HTTPProxy (this means Ingress/Gateway API HTTPRoute don't have this option)
  • It really depends what users actually will find useful, do we need the granularity of per-Route rule stats or is it enough to know what HTTPProxy/Ingress/HTTPRoute stats may come from?

sunjayBhatia avatar Jul 17 '23 20:07 sunjayBhatia

@therealak12 , thx for your pr, i am ok with it

izturn avatar Jul 19 '23 02:07 izturn

@therealak12 , thx for your pr, i am ok with it

can you clarify a bit more @izturn how do you intend to use the new stats? specifically around the last bullet here: https://github.com/projectcontour/contour/pull/5535#issuecomment-1638789265

sunjayBhatia avatar Jul 20 '23 17:07 sunjayBhatia

Sorry for the late reply, I think per route stats is enough for our current situation, if some use cases for per virtualhost, just calculated by metrics system is OK.

wilsonwu avatar Jul 31 '23 03:07 wilsonwu

@sunjayBhatia Sorry for the late reply, As @wilsonwu said, I think per route stats is enough for our use case

izturn avatar Aug 07 '23 02:08 izturn

As @sunjayBhatia pointed above, the per-route metrics are located at the highest level of the granularity pyramid:

do we need the granularity of per-Route rule stats

So let's discuss whether per-virtualhost metrics is enough or if we should agree on a format for presenting per-route metrics. In our use case, we are happy with only including the prefix from matchConditions. In general, there are several matchConditions that can differentiate routes. Including all of them results in excessively lengthy metric names.

Your valuable comment will be crucial in reaching the best decision. Thanks in advance.

@wilsonwu @izturn

therealak12 avatar Aug 07 '23 06:08 therealak12

As @sunjayBhatia pointed above, the per-route metrics are located at the highest level of the granularity pyramid:

do we need the granularity of per-Route rule stats

So let's discuss whether per-virtualhost metrics is enough or if we should agree on a format for presenting per-route metrics. In our use case, we are happy with only including the prefix from matchConditions. In general, there are several matchConditions that can differentiate routes. Including all of them results in excessively lengthy metric names.

Your valuable comment will be crucial in reaching the best decision. Thanks in advance.

@wilsonwu @izturn

Thanks for these information, I our use case, we make envoy as an API gateway, so users pay more attention to the API, and as my early reply I thnk if we have per route metrics, the per virtualhost and per envoy metrics can be caculated be themselves, anyway, support per virtualhost metrics directly is better.

wilsonwu avatar Aug 08 '23 02:08 wilsonwu

@sunjayBhatia Based on the discussion, it seems providing per-virtual-host metrics is enough. I'd like to hear your thoughts on this.

therealak12 avatar Aug 08 '23 12:08 therealak12

@sunjayBhatia Based on the discussion, it seems providing per-virtual-host metrics is enough. I'd like to hear your thoughts on this.

to summarize:

  • it looks like the "per route" metrics that are important to @wilsonwu /@izturn are in fact actually "per vhost", so aggregating request metrics needs to be done with that in mind when making up the stat prefix
  • the new stats will be rooted at vhost.<virtual host name>.route.<stat_prefix> so a static stat_prefix seems sufficient, since we don't need to add any further granularity
  • to be conservative and make this an additive change, how users opt-in to these new stats could be a boolean or letting users configure the route stat_prefix, we can discuss the merits of either approach but I'm leaning towards a bool opt-in to require less validation etc.
    • furthermore, is it sufficient to have this as a global opt-in or per HTTPProxy? I would argue for it to be a global config since this is an operator-level configuration that could affect Envoy's performance
  • as an aside, if we did want to add some additional granularity we could (generate the stat prefix based on the resource that originated the Route for example), and users who don't care about that additional detail can maybe aggregate the statistic themselves (though we would have to think about that UX in more detail)

sunjayBhatia avatar Aug 08 '23 20:08 sunjayBhatia

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions[bot] avatar Aug 23 '23 00:08 github-actions[bot]

hey @therealak12, thx 4 ur work, any progress?

izturn avatar Aug 23 '23 03:08 izturn

hey @therealak12, thx 4 ur work, any progress?

Hey. I'm busy these days. I'll update this thread soon.

We need to agree on some design decisions first.

  • I agree with having a boolean to enable/disable the stat_prefix. However, I think a global one is better than adding a per-resource field. The reason is that it fits our requirements better. This is the first thing to decide.

  • Also, I think including the name of the resource in the stat_prefix is fine and better than not doing that. This is the second decision.

If these decisions are accepted, I'll update this MR or create a new one.

therealak12 avatar Aug 23 '23 06:08 therealak12

hey @therealak12, thx 4 ur work, any progress?

Hey. I'm busy these days. I'll update this thread soon.

We need to agree on some design decisions first.

  • I agree with having a boolean to enable/disable the stat_prefix. However, I think a global one is better than adding a per-resource field. The reason is that it fits our requirements better. This is the first thing to decide.
  • Also, I think including the name of the resource in the stat_prefix is fine and better than not doing that. This is the second decision.

If these decisions are accepted, I'll update this MR or create a new one.

  • Agree the global enable/disable for stat_prefix.
  • And for include a resource name or not, I think currently the path is OK, no more voice for this requirement, I am fine for it.

wilsonwu avatar Aug 23 '23 08:08 wilsonwu

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions[bot] avatar Sep 07 '23 00:09 github-actions[bot]

So I've just (force) pushed a new commit. If the enableStatPrefix configuration is true, it sets stat_prefix to <resource_namespace>_<resource_name>.

This actually resembles the idea implemented in pr#5539. But as it was closed, I preferred to rewrite the changes in this PR.

therealak12 avatar Sep 12 '23 14:09 therealak12

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions[bot] avatar Sep 27 '23 00:09 github-actions[bot]

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions[bot] avatar Oct 13 '23 00:10 github-actions[bot]

@therealak12 any progress?

izturn avatar Oct 19 '23 02:10 izturn

@therealak12 any progress?

This is actually ready for review. I'll resolve merge conflicts soon.

therealak12 avatar Oct 19 '23 07:10 therealak12

Codecov Report

Attention: Patch coverage is 20.00000% with 20 lines in your changes are missing coverage. Please review.

Project coverage is 78.62%. Comparing base (89d2856) to head (fdf3ab4). Report is 227 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5535      +/-   ##
==========================================
- Coverage   78.70%   78.62%   -0.08%     
==========================================
  Files         138      138              
  Lines       19717    19742      +25     
==========================================
+ Hits        15518    15523       +5     
- Misses       3895     3909      +14     
- Partials      304      310       +6     
Files Coverage Δ
cmd/contour/servecontext.go 83.88% <100.00%> (+0.03%) :arrow_up:
internal/contourconfig/contourconfiguration.go 98.19% <100.00%> (+0.01%) :arrow_up:
internal/dag/dag.go 98.70% <ø> (ø)
pkg/config/parameters.go 86.43% <100.00%> (+0.04%) :arrow_up:
cmd/contour/serve.go 19.62% <50.00%> (+0.12%) :arrow_up:
internal/dag/ingress_processor.go 96.44% <0.00%> (-1.16%) :arrow_down:
internal/envoy/v3/route.go 80.17% <0.00%> (-0.31%) :arrow_down:
internal/dag/gatewayapi_processor.go 93.38% <0.00%> (-0.36%) :arrow_down:
internal/dag/httpproxy_processor.go 91.54% <0.00%> (-0.38%) :arrow_down:

codecov[bot] avatar Oct 19 '23 08:10 codecov[bot]

It's ready now. @izturn

therealak12 avatar Oct 19 '23 08:10 therealak12