grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

ringhash: implement gRFC A76

Open atollena opened this issue 10 months ago • 3 comments

Fixes https://github.com/grpc/grpc-go/issues/8147.

  • Add the ability to set the request hash key, to extract the hash from a header. This allows using ring hash without xDS.

  • Add the ability to specify the location of endpoints on the ring, and a default implementation based on EDS metadata that matches Envoy behavior.

See https://github.com/grpc/proposal/blob/master/A76-ring-hash-improvements.md for details.

RELEASE NOTES:

  • ringhash: implement gRFC A76.

atollena avatar Mar 11 '25 09:03 atollena

Codecov Report

Attention: Patch coverage is 97.45223% with 4 lines in your changes missing coverage. Please review.

Project coverage is 81.97%. Comparing base (a51009d) to head (8816ae5). Report is 13 commits behind head on master.

Files with missing lines Patch % Lines
internal/testutils/xds/e2e/clientresources.go 75.00% 1 Missing and 1 partial :warning:
xds/internal/balancer/ringhash/picker.go 96.42% 2 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8159      +/-   ##
==========================================
+ Coverage   81.91%   81.97%   +0.06%     
==========================================
  Files         410      412       +2     
  Lines       40216    40498     +282     
==========================================
+ Hits        32942    33198     +256     
- Misses       5894     5912      +18     
- Partials     1380     1388       +8     
Files with missing lines Coverage Δ
internal/envconfig/envconfig.go 100.00% <ø> (ø)
internal/metadata/metadata.go 82.08% <100.00%> (+1.44%) :arrow_up:
internal/testutils/envconfig.go 100.00% <100.00%> (ø)
resolver/ringhash/attr.go 100.00% <100.00%> (ø)
...internal/balancer/clusterresolver/configbuilder.go 91.22% <100.00%> (+0.05%) :arrow_up:
xds/internal/balancer/ringhash/config.go 92.10% <100.00%> (+3.21%) :arrow_up:
xds/internal/balancer/ringhash/ring.go 100.00% <100.00%> (ø)
xds/internal/balancer/ringhash/ringhash.go 94.22% <100.00%> (+0.35%) :arrow_up:
xds/internal/balancer/ringhash/util.go 100.00% <100.00%> (ø)
xds/internal/resolver/serviceconfig.go 85.71% <100.00%> (ø)
... and 3 more

... and 37 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Mar 11 '25 09:03 codecov[bot]

Thanks for the review @danielzhaotongliu . I took your feedback into consideration.

There is a failing test which I think may be flaky? (it's Test/End2End in the advancedtls module which seems unrelated). Also the code coverage report didn't run because of a missing token, not sure if that's expected.

atollena avatar Mar 21 '25 13:03 atollena

I also fixed the configuration to use camelCase instead of snake_case: requestHashHeader instead of request_hash_header, since this is the convention for service config (it's somewhat hidden in the spec since it relies on protojson conventions).

atollena avatar Mar 27 '25 17:03 atollena

The changes look good to me except I want to understand a little more here: https://github.com/grpc/grpc-go/pull/8159#discussion_r2013121441

easwars avatar Mar 31 '25 21:03 easwars

The new Test/RingHash_RequestHashKeyConnecting is flaky with 2 wrong results:

  1. https://github.com/grpc/grpc-go/actions/runs/14198973868/job/39780980028?pr=8159 happens presumably because the balancer did not have time to attempt connection by the time the test failed. This is easy to reproduce on an overloaded system:
        tlogger.go:116: INFO balancer_wrapper.go:122 [core] [Channel #443]Channel switches to new LB policy "ring_hash_experimental"  (t=+54.462972ms)
        tlogger.go:116: INFO ringhash.go:65 [xds] [ring-hash-lb 0x40006ba7d0] Created  (t=+54.659732ms)
        tlogger.go:116: INFO clientconn.go:333 [core] [Channel #443]Channel exiting idle mode  (t=+59.793742ms)
        ringhash_balancer_test.go:2851: Got 0 new connection attempts for 1 RPC, want 1
  1. https://github.com/grpc/grpc-go/actions/runs/14198973868/job/39780980474?pr=8159 is more concerning: we triggered 3 connections with a single RPC. I need to dig into this a bit more into this one, but I'm having trouble reproducing it.
tlogger.go:116: INFO balancer_wrapper.go:122 [core] [Channel #993]Channel switches to new LB policy "ring_hash_experimental"  (t=+844.313µs)
        tlogger.go:116: INFO ringhash.go:65 [xds] [ring-hash-lb 0xc0007a6050] Created  (t=+861.886µs)
        tlogger.go:116: INFO clientconn.go:333 [core] [Channel #993]Channel exiting idle mode  (t=+1.066039ms)
        tlogger.go:116: INFO clientconn.go:563 [core] [Channel #993]Channel Connectivity change to CONNECTING  (t=+1.156518ms)
        tlogger.go:116: INFO clientconn.go:857 [core] [Channel #993 SubChannel #995]Subchannel created  (t=+1.175083ms)
        tlogger.go:116: INFO clientconn.go:1224 [core] [Channel #993 SubChannel #995]Subchannel Connectivity change to CONNECTING  (t=+1.196734ms)
        tlogger.go:116: INFO clientconn.go:1343 [core] [Channel #993 SubChannel #995]Subchannel picks a new address "127.0.0.1:38697" to connect  (t=+1.212413ms)
        tlogger.go:116: INFO blocking_context_dialer.go:62 [testutils] Hold 0xc00078b290: Intercepted connection attempt to addr "127.0.0.1:38697"  (t=+1.230607ms)
        tlogger.go:116: INFO blocking_context_dialer.go:118 [testutils] Hold 0xc00078b290: Resuming connection attempt to addr "127.0.0.1:38697"  (t=+21.111214ms)
        tlogger.go:116: INFO clientconn.go:1224 [core] [Channel #993 SubChannel #995]Subchannel Connectivity change to READY  (t=+21.888224ms)
        tlogger.go:116: INFO clientconn.go:563 [core] [Channel #993]Channel Connectivity change to READY  (t=+21.954167ms)
        tlogger.go:116: INFO clientconn.go:857 [core] [Channel #993 SubChannel #998]Subchannel created  (t=+23.80122ms)
        tlogger.go:116: INFO clientconn.go:1224 [core] [Channel #993 SubChannel #998]Subchannel Connectivity change to CONNECTING  (t=+23.846886ms)
        tlogger.go:116: INFO clientconn.go:1343 [core] [Channel #993 SubChannel #998]Subchannel picks a new address "127.0.0.1:39911" to connect  (t=+23.871232ms)
        tlogger.go:116: INFO blocking_context_dialer.go:62 [testutils] Hold 0xc00078b2c0: Intercepted connection attempt to addr "127.0.0.1:39911"  (t=+23.963525ms)
        tlogger.go:116: INFO clientconn.go:857 [core] [Channel #993 SubChannel #999]Subchannel created  (t=+23.985777ms)
        tlogger.go:116: INFO clientconn.go:857 [core] [Channel #993 SubChannel #1000]Subchannel created  (t=+24.025371ms)
        tlogger.go:116: INFO clientconn.go:1224 [core] [Channel #993 SubChannel #1000]Subchannel Connectivity change to CONNECTING  (t=+24.065296ms)
        tlogger.go:116: INFO clientconn.go:1343 [core] [Channel #993 SubChannel #1000]Subchannel picks a new address "127.0.0.1:37063" to connect  (t=+24.136706ms)
        tlogger.go:116: INFO blocking_context_dialer.go:62 [testutils] Hold 0xc00078b[320](https://github.com/grpc/grpc-go/actions/runs/14198973868/job/39780980474?pr=8159#step:8:321): Intercepted connection attempt to addr "127.0.0.1:37063"  (t=+24.166442ms)
        tlogger.go:116: INFO clientconn.go:1224 [core] [Channel #993 SubChannel #999]Subchannel Connectivity change to CONNECTING  (t=+24.196068ms)
        tlogger.go:116: INFO clientconn.go:1343 [core] [Channel #993 SubChannel #999]Subchannel picks a new address "127.0.0.1:44925" to connect  (t=+24.219642ms)
        tlogger.go:116: INFO blocking_context_dialer.go:62 [testutils] Hold 0xc00078b2f0: Intercepted connection attempt to addr "127.0.0.1:44925"  (t=+24.251512ms)
        ringhash_balancer_test.go:2903: Got 3 connection attempts, want at most 1

I can't fix 1 by deterministically waiting for a connection, but then I would miss cases where more connections are to come...

atollena avatar Apr 01 '25 15:04 atollena

After our discussion yesterday:

  1. I'll update the picker to call ExitIdle on the underlying balancer directly instead of relying on Pick to wake up the subchannel.
  2. I'll change the tests to avoid the flakiness mentioned above.

atollena avatar Apr 02 '25 09:04 atollena

There is a race in a ringhash test caught by CI: https://github.com/grpc/grpc-go/actions/runs/14220675758/job/39847655816?pr=8159

arjan-bal avatar Apr 03 '25 08:04 arjan-bal

There is a race in a ringhash test caught by CI: https://github.com/grpc/grpc-go/actions/runs/14220675758/job/39847655816?pr=8159

Yep, that was just the err variable local to the test. Fixed in the latest commit.

atollena avatar Apr 03 '25 08:04 atollena

I also opened https://github.com/grpc/proposal/pull/489 to clarify some of the behaviors I discovered during the implementation. It also answers the concern from @arjan-bal regarding duplicate hash keys.

atollena avatar Apr 03 '25 08:04 atollena