envoy
envoy copied to clipboard
api: add total_{active,new,error}_connections to UpstreamLocalityStats
The new fields are marked with [#not-implemented-hide:] and work_in_progress per the guide.
Implementation will be added later in a separate PR.
Rationale: Current version of LRS protocol uses requests as a unit of load. L4 load balancers don't have information on the number of requests, but they can approximate it by the number of new and active connections.
Envoy already collects and exposes some of the statistics that can be used for the implementation of the new fields:
total_new_connections~upstream_cx_total(now) - upstream_cx_total(now - load_report_interval)total_active_connections~upstream_cx_activetotal_error_connections~upstream_cx_connect_timeout + upstream_cx_idle_timeout + ...
Please note that the linked stats are aggregated per cluster, while the new fields will correspond to UpstreamLocality, so the actual implementation will be more nuanced.
Commit Message: api: add total_{active,new,error}_connections to UpstreamLocalityStats Additional Description: Risk Level: Testing: Docs Changes: Release Notes: Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional API Considerations:]
Hi @AwesomePatrol, welcome and thank you for your contribution.
We will try to review your Pull Request as quickly as possible.
In the meantime, please take a look at the contribution guidelines if you have not done so already.
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @abeyad
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).
this should also make sense for TCP proxy in Envoy right? The motivation comes from non-Envoy xDS use cases I believe, but thsi should be implementable and usable for our own L4 LB needs ideally.
Yes, I think it should work as expected; the stats in question are on the cluster, and I believe they're all set correctly in the tcp-proxy case.
Note that when implementing this, because the stats need to be per-locality, I think you'll be limited to "host stats", and won't be able to use the cluster-wide stats.
#define ALL_HOST_STATS(COUNTER, GAUGE) \
COUNTER(cx_connect_fail) \
COUNTER(cx_total) \
COUNTER(rq_error) \
COUNTER(rq_success) \
COUNTER(rq_timeout) \
COUNTER(rq_total) \
GAUGE(cx_active) \
GAUGE(rq_active)
But I think these cover what you need, so it should be fine. If you don't think these stats will be sufficient, we should discuss up front, because the memory-use impact of adding host stats is very high.
/wait
@abeyad are comments addressed or should this be tagged as waiting?
@AwesomePatrol I think we decided on:
Rename this field to total_connect_fail and/or point out in the comment that this is similar to cx_connect_fail metric
But the PR hasn't been updated accordingly. Tagging as waiting until it's addressed. Thanks!
/wait
@AwesomePatrol I think we decided on:
Rename this field to total_connect_fail and/or point out in the comment that this is similar to cx_connect_fail metricBut the PR hasn't been updated accordingly. Tagging as waiting until it's addressed. Thanks!
The field was renamed to total_fail_connections (for consistency with total_{active,new}_connections). Should I change it again?
The description of PR (which I assume will become the commit description) was also updated to reflect that the field corresponds to cx_connect_fail metric now.
The field was renamed to total_fail_connections (for consistency with total_{active,new}_connections). Should I change it again?
No that's fine, consistency is better.
The description of PR (which I assume will become the commit description) was also updated to reflect that the field corresponds to cx_connect_fail metric now.
Ah ok I missed the PR description change. Those comments on each field would actually be better as comments in the proto file, if you could kindly add it to the PR, thanks!
Also, I suggest merging main, as there are some failing CI jobs, not sure if merging main will resolve it.
/retest