envoy api: add total_{active,new,error}_connections to UpstreamLocalityStats

The new fields are marked with [#not-implemented-hide:] and work_in_progress per the guide.

Implementation will be added later in a separate PR.

Rationale: Current version of LRS protocol uses requests as a unit of load. L4 load balancers don't have information on the number of requests, but they can approximate it by the number of new and active connections.

Envoy already collects and exposes some of the statistics that can be used for the implementation of the new fields:

total_new_connections ~ upstream_cx_total(now) - upstream_cx_total(now - load_report_interval)
total_active_connections ~ upstream_cx_active
total_error_connections ~ upstream_cx_connect_timeout + upstream_cx_idle_timeout + ...

Please note that the linked stats are aggregated per cluster, while the new fields will correspond to UpstreamLocality, so the actual implementation will be more nuanced.

Commit Message: api: add total_{active,new,error}_connections to UpstreamLocalityStats Additional Description: Risk Level: Testing: Docs Changes: Release Notes: Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional API Considerations:]

May 28 '24 09:05 AwesomePatrol

Hi @AwesomePatrol, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/34385 was opened by AwesomePatrol.

see: more, trace.

May 28 '24 09:05 repokitteh-read-only[bot]

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/). envoyproxy/api-shepherds assignee is @abeyad CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/34385 was opened by AwesomePatrol.

see: more, trace.

May 28 '24 09:05 repokitteh-read-only[bot]

this should also make sense for TCP proxy in Envoy right? The motivation comes from non-Envoy xDS use cases I believe, but thsi should be implementable and usable for our own L4 LB needs ideally.

Yes, I think it should work as expected; the stats in question are on the cluster, and I believe they're all set correctly in the tcp-proxy case.

May 31 '24 15:05 ggreenway

Note that when implementing this, because the stats need to be per-locality, I think you'll be limited to "host stats", and won't be able to use the cluster-wide stats.

#define ALL_HOST_STATS(COUNTER, GAUGE)                                                             \
  COUNTER(cx_connect_fail)                                                                         \
  COUNTER(cx_total)                                                                                \
  COUNTER(rq_error)                                                                                \
  COUNTER(rq_success)                                                                              \
  COUNTER(rq_timeout)                                                                              \
  COUNTER(rq_total)                                                                                \
  GAUGE(cx_active)                                                                                 \
  GAUGE(rq_active)

But I think these cover what you need, so it should be fine. If you don't think these stats will be sufficient, we should discuss up front, because the memory-use impact of adding host stats is very high.

May 31 '24 15:05 ggreenway

/wait

Jun 07 '24 14:06 abeyad

@abeyad are comments addressed or should this be tagged as waiting?

Jun 13 '24 12:06 alyssawilk

@AwesomePatrol I think we decided on:

Rename this field to total_connect_fail and/or point out in the comment that this is similar to cx_connect_fail metric

But the PR hasn't been updated accordingly. Tagging as waiting until it's addressed. Thanks!

Jun 13 '24 12:06 abeyad

/wait

Jun 13 '24 12:06 abeyad

@AwesomePatrol I think we decided on:
Rename this field to total_connect_fail and/or point out in the comment that this is similar to cx_connect_fail metric
But the PR hasn't been updated accordingly. Tagging as waiting until it's addressed. Thanks!

The field was renamed to total_fail_connections (for consistency with total_{active,new}_connections). Should I change it again?

The description of PR (which I assume will become the commit description) was also updated to reflect that the field corresponds to cx_connect_fail metric now.

Jun 13 '24 12:06 AwesomePatrol

The field was renamed to total_fail_connections (for consistency with total_{active,new}_connections). Should I change it again?

No that's fine, consistency is better.

The description of PR (which I assume will become the commit description) was also updated to reflect that the field corresponds to cx_connect_fail metric now.

Ah ok I missed the PR description change. Those comments on each field would actually be better as comments in the proto file, if you could kindly add it to the PR, thanks!

Also, I suggest merging main, as there are some failing CI jobs, not sure if merging main will resolve it.

Jun 13 '24 13:06 abeyad

/retest

Jun 18 '24 14:06 abeyad

envoy envoy copied to clipboard

api: add total_{active,new,error}_connections to UpstreamLocalityStats

envoy
envoy copied to clipboard