ringhash: implement gRFC A76
Fixes https://github.com/grpc/grpc-go/issues/8147.
-
Add the ability to set the request hash key, to extract the hash from a header. This allows using ring hash without xDS.
-
Add the ability to specify the location of endpoints on the ring, and a default implementation based on EDS metadata that matches Envoy behavior.
See https://github.com/grpc/proposal/blob/master/A76-ring-hash-improvements.md for details.
RELEASE NOTES:
- ringhash: implement gRFC A76.
Codecov Report
Attention: Patch coverage is 97.45223% with 4 lines in your changes missing coverage. Please review.
Project coverage is 81.97%. Comparing base (
a51009d) to head (8816ae5). Report is 13 commits behind head on master.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| internal/testutils/xds/e2e/clientresources.go | 75.00% | 1 Missing and 1 partial :warning: |
| xds/internal/balancer/ringhash/picker.go | 96.42% | 2 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #8159 +/- ##
==========================================
+ Coverage 81.91% 81.97% +0.06%
==========================================
Files 410 412 +2
Lines 40216 40498 +282
==========================================
+ Hits 32942 33198 +256
- Misses 5894 5912 +18
- Partials 1380 1388 +8
| Files with missing lines | Coverage Δ | |
|---|---|---|
| internal/envconfig/envconfig.go | 100.00% <ø> (ø) |
|
| internal/metadata/metadata.go | 82.08% <100.00%> (+1.44%) |
:arrow_up: |
| internal/testutils/envconfig.go | 100.00% <100.00%> (ø) |
|
| resolver/ringhash/attr.go | 100.00% <100.00%> (ø) |
|
| ...internal/balancer/clusterresolver/configbuilder.go | 91.22% <100.00%> (+0.05%) |
:arrow_up: |
| xds/internal/balancer/ringhash/config.go | 92.10% <100.00%> (+3.21%) |
:arrow_up: |
| xds/internal/balancer/ringhash/ring.go | 100.00% <100.00%> (ø) |
|
| xds/internal/balancer/ringhash/ringhash.go | 94.22% <100.00%> (+0.35%) |
:arrow_up: |
| xds/internal/balancer/ringhash/util.go | 100.00% <100.00%> (ø) |
|
| xds/internal/resolver/serviceconfig.go | 85.71% <100.00%> (ø) |
|
| ... and 3 more |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Thanks for the review @danielzhaotongliu . I took your feedback into consideration.
There is a failing test which I think may be flaky? (it's Test/End2End in the advancedtls module which seems unrelated). Also the code coverage report didn't run because of a missing token, not sure if that's expected.
I also fixed the configuration to use camelCase instead of snake_case: requestHashHeader instead of request_hash_header, since this is the convention for service config (it's somewhat hidden in the spec since it relies on protojson conventions).
The changes look good to me except I want to understand a little more here: https://github.com/grpc/grpc-go/pull/8159#discussion_r2013121441
The new Test/RingHash_RequestHashKeyConnecting is flaky with 2 wrong results:
- https://github.com/grpc/grpc-go/actions/runs/14198973868/job/39780980028?pr=8159 happens presumably because the balancer did not have time to attempt connection by the time the test failed. This is easy to reproduce on an overloaded system:
tlogger.go:116: INFO balancer_wrapper.go:122 [core] [Channel #443]Channel switches to new LB policy "ring_hash_experimental" (t=+54.462972ms)
tlogger.go:116: INFO ringhash.go:65 [xds] [ring-hash-lb 0x40006ba7d0] Created (t=+54.659732ms)
tlogger.go:116: INFO clientconn.go:333 [core] [Channel #443]Channel exiting idle mode (t=+59.793742ms)
ringhash_balancer_test.go:2851: Got 0 new connection attempts for 1 RPC, want 1
- https://github.com/grpc/grpc-go/actions/runs/14198973868/job/39780980474?pr=8159 is more concerning: we triggered 3 connections with a single RPC. I need to dig into this a bit more into this one, but I'm having trouble reproducing it.
tlogger.go:116: INFO balancer_wrapper.go:122 [core] [Channel #993]Channel switches to new LB policy "ring_hash_experimental" (t=+844.313µs)
tlogger.go:116: INFO ringhash.go:65 [xds] [ring-hash-lb 0xc0007a6050] Created (t=+861.886µs)
tlogger.go:116: INFO clientconn.go:333 [core] [Channel #993]Channel exiting idle mode (t=+1.066039ms)
tlogger.go:116: INFO clientconn.go:563 [core] [Channel #993]Channel Connectivity change to CONNECTING (t=+1.156518ms)
tlogger.go:116: INFO clientconn.go:857 [core] [Channel #993 SubChannel #995]Subchannel created (t=+1.175083ms)
tlogger.go:116: INFO clientconn.go:1224 [core] [Channel #993 SubChannel #995]Subchannel Connectivity change to CONNECTING (t=+1.196734ms)
tlogger.go:116: INFO clientconn.go:1343 [core] [Channel #993 SubChannel #995]Subchannel picks a new address "127.0.0.1:38697" to connect (t=+1.212413ms)
tlogger.go:116: INFO blocking_context_dialer.go:62 [testutils] Hold 0xc00078b290: Intercepted connection attempt to addr "127.0.0.1:38697" (t=+1.230607ms)
tlogger.go:116: INFO blocking_context_dialer.go:118 [testutils] Hold 0xc00078b290: Resuming connection attempt to addr "127.0.0.1:38697" (t=+21.111214ms)
tlogger.go:116: INFO clientconn.go:1224 [core] [Channel #993 SubChannel #995]Subchannel Connectivity change to READY (t=+21.888224ms)
tlogger.go:116: INFO clientconn.go:563 [core] [Channel #993]Channel Connectivity change to READY (t=+21.954167ms)
tlogger.go:116: INFO clientconn.go:857 [core] [Channel #993 SubChannel #998]Subchannel created (t=+23.80122ms)
tlogger.go:116: INFO clientconn.go:1224 [core] [Channel #993 SubChannel #998]Subchannel Connectivity change to CONNECTING (t=+23.846886ms)
tlogger.go:116: INFO clientconn.go:1343 [core] [Channel #993 SubChannel #998]Subchannel picks a new address "127.0.0.1:39911" to connect (t=+23.871232ms)
tlogger.go:116: INFO blocking_context_dialer.go:62 [testutils] Hold 0xc00078b2c0: Intercepted connection attempt to addr "127.0.0.1:39911" (t=+23.963525ms)
tlogger.go:116: INFO clientconn.go:857 [core] [Channel #993 SubChannel #999]Subchannel created (t=+23.985777ms)
tlogger.go:116: INFO clientconn.go:857 [core] [Channel #993 SubChannel #1000]Subchannel created (t=+24.025371ms)
tlogger.go:116: INFO clientconn.go:1224 [core] [Channel #993 SubChannel #1000]Subchannel Connectivity change to CONNECTING (t=+24.065296ms)
tlogger.go:116: INFO clientconn.go:1343 [core] [Channel #993 SubChannel #1000]Subchannel picks a new address "127.0.0.1:37063" to connect (t=+24.136706ms)
tlogger.go:116: INFO blocking_context_dialer.go:62 [testutils] Hold 0xc00078b[320](https://github.com/grpc/grpc-go/actions/runs/14198973868/job/39780980474?pr=8159#step:8:321): Intercepted connection attempt to addr "127.0.0.1:37063" (t=+24.166442ms)
tlogger.go:116: INFO clientconn.go:1224 [core] [Channel #993 SubChannel #999]Subchannel Connectivity change to CONNECTING (t=+24.196068ms)
tlogger.go:116: INFO clientconn.go:1343 [core] [Channel #993 SubChannel #999]Subchannel picks a new address "127.0.0.1:44925" to connect (t=+24.219642ms)
tlogger.go:116: INFO blocking_context_dialer.go:62 [testutils] Hold 0xc00078b2f0: Intercepted connection attempt to addr "127.0.0.1:44925" (t=+24.251512ms)
ringhash_balancer_test.go:2903: Got 3 connection attempts, want at most 1
I can't fix 1 by deterministically waiting for a connection, but then I would miss cases where more connections are to come...
After our discussion yesterday:
- I'll update the picker to call
ExitIdleon the underlying balancer directly instead of relying onPickto wake up the subchannel. - I'll change the tests to avoid the flakiness mentioned above.
There is a race in a ringhash test caught by CI: https://github.com/grpc/grpc-go/actions/runs/14220675758/job/39847655816?pr=8159
There is a race in a ringhash test caught by CI: https://github.com/grpc/grpc-go/actions/runs/14220675758/job/39847655816?pr=8159
Yep, that was just the err variable local to the test. Fixed in the latest commit.
I also opened https://github.com/grpc/proposal/pull/489 to clarify some of the behaviors I discovered during the implementation. It also answers the concern from @arjan-bal regarding duplicate hash keys.