grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

stats/otel: Add subchannel metrics (A94)

Open mbissa opened this issue 4 months ago • 6 comments

Addresses : https://github.com/grpc/proposal/blob/master/A94-subchannel-otel-metrics.md

this PR adds sub channel metrics with applicable labels as per the RFC proposal.

RELEASE NOTES:

  • stats/otel: add subchannel metrics to eventually replace the pickfirst metrics

mbissa avatar Dec 02 '25 19:12 mbissa

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 83.37%. Comparing base (432bda3) to head (cc6c6e3). :warning: Report is 8 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8738      +/-   ##
==========================================
+ Coverage   83.22%   83.37%   +0.15%     
==========================================
  Files         419      418       -1     
  Lines       32454    32425      -29     
==========================================
+ Hits        27009    27034      +25     
+ Misses       4057     4010      -47     
+ Partials     1388     1381       -7     
Files with missing lines Coverage Δ
clientconn.go 90.44% <100.00%> (-0.06%) :arrow_down:
internal/internal.go 100.00% <ø> (ø)
internal/transport/http2_client.go 92.80% <100.00%> (+0.18%) :arrow_up:
internal/transport/transport.go 90.86% <ø> (+2.15%) :arrow_up:
internal/xds/xds.go 80.55% <100.00%> (+5.55%) :arrow_up:
stream.go 81.88% <100.00%> (+0.04%) :arrow_up:

... and 36 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Dec 02 '25 20:12 codecov[bot]

@arjan-bal

A94 states the following:

If we end up discarding connection attempts as we do with the “happy eyeballs” algorithm
(as per A61), we should not record the connection attempt or the disconnection.

How are we currently handling this in the pickfirst metrics?

easwars avatar Dec 02 '25 22:12 easwars

@mbissa

A94 states the following:

Implementations that have already implemented the pick-first metrics should give 
enough time for users to transition to the new metrics. For example, implementations 
should report both the old pick-first metrics and the new subchannel metrics for
2 releases, and then remove the old pick-first metrics.

Can you please ensure that we have an issue filed to track the removal of the old metrics and that it captures the correct release where it needs to be removed.

easwars avatar Dec 02 '25 22:12 easwars

@arjan-bal

A94 states the following:

If we end up discarding connection attempts as we do with the “happy eyeballs” algorithm
(as per A61), we should not record the connection attempt or the disconnection.

How are we currently handling this in the pickfirst metrics?

Here is how pickfirst handles this:

  1. When a subchannel connects, all remaining subchannels are removed from pickfirst's subchannel map. https://github.com/grpc/grpc-go/blob/647162c379ddeb85f45c6c43e9274662d2d1e2d6/balancer/pickfirst/pickfirst.go#L603-L605
  2. The updateSubConnState method ignores any updates from subchannels that are not in the subchannel map. https://github.com/grpc/grpc-go/blob/647162c379ddeb85f45c6c43e9274662d2d1e2d6/balancer/pickfirst/pickfirst.go#L585-L591

arjan-bal avatar Dec 04 '25 09:12 arjan-bal

@mbissa

A94 states the following:

Implementations that have already implemented the pick-first metrics should give 
enough time for users to transition to the new metrics. For example, implementations 
should report both the old pick-first metrics and the new subchannel metrics for
2 releases, and then remove the old pick-first metrics.

Can you please ensure that we have an issue filed to track the removal of the old metrics and that it captures the correct release where it needs to be removed.

Issue filed - https://github.com/grpc/grpc-go/issues/8752

mbissa avatar Dec 08 '25 12:12 mbissa

@arjan-bal

A94 states the following:

If we end up discarding connection attempts as we do with the “happy eyeballs” algorithm
(as per A61), we should not record the connection attempt or the disconnection.

How are we currently handling this in the pickfirst metrics?

Handled as per PR

mbissa avatar Dec 08 '25 12:12 mbissa

Looks good mostly. Just some minor nits.

Can you also please update the PR description to note that the disconnection_reason will be plumbed in a follow-up PR. Thanks.

done.

mbissa avatar Dec 10 '25 20:12 mbissa