grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

XDS Ring Hash inconsistent after client and server restarts

Open k-raval opened this issue 1 year ago • 30 comments

GRPC Version - 1.60.1 Go Version - 1.19.1 OS - Linux 5.4.0-81-generic #91-Ubuntu SMP aarch64 GNU/Linux

Setup:

  • Kubernetes cluster with Istio (1.20.2) installed
  • 3 GRPC servers (S1, S2, S3) run as one kubernetes deployment with 3 replicas
  • 3 GRPC clients (C1, C2, C3) run as one kubernetes deployment with 3 replicas
  • Istio virtual service and destination rule pointing to server-service (and load balancer policy as consistent hash based on GRPC header)
  • Servers and Clients using GRPC XDS feature
  • Clients send a Hello message to server over GRPC with metadata as a header field - "XYZ"
  • The GRPC XDS ring hash is supposed to do hashing on this header and route it to one of the servers
  • ALL clients are repeating this in a loop and with same field value, and hence all messages are reaching just 1 server (which has been chosen based on hash of value "XYZ")

Problem: When one server and one client each are killed (or crash at the same time) and they restart back up again, now the hashing pattern changes. Some clients send the message to one server, but other clients send the message to S2. The header field is still the same, however different clients now hash it to different destinations.

Expectation: No matter in what order client or server go down and come back up, the final hashing logic should be such that a given field value should be hashed to exactly same destination.

Other Information: Instead of using GRPC XDS hashing, if we try to use Envoy sidecar which does its own hashing, we see the expected behaviour always. However, we want to use GRPC XDS features as it avoids one hop to Envoy proxy and gives us better performance.

@easwars - we talked about similar in a question/answer format some time back #5933 . Appreciate your help here. thanks in advance.

k-raval avatar Jan 18 '24 11:01 k-raval

A point to add - Since this setup is on Kubernetes when the clients or servers crash and come back up, they come up with different IP address/names. Just for info if that makes any difference.

k-raval avatar Jan 18 '24 11:01 k-raval

Hi @ginayeh , any update ? let me know if more info is needed. Thanks.

k-raval avatar Jan 30 '24 12:01 k-raval

Hi @zasweq , @ginayeh - any update ? awaiting your inputs. Thanks.

k-raval avatar Feb 06 '24 07:02 k-raval

Hi @k-raval, thanks for the patience. Are you only seeing this inconsistent behavior when both client AND server are killed? Would you mind sharing your example code to reproduce the ring hash inconsistent behavior?

ginayeh avatar Feb 06 '24 23:02 ginayeh

To answer your question, when I kill either one server or one client independently, this behaviour is not observed, but when I kill a server and client simultaneously, this problem is observed.

Attaching the codebase here. Directory contains:

  1. Client code. <go build -o client cmd/client/main.go>
  2. Server code <go build -o server cmd/server/main.go>
  3. server.yaml file - Replace the path of local dir from which server binary is mounted, search for path - /home/kraval/test/grpctest
  4. client.yaml file - Replace the path of local dir from which client binary is mounted, search for path - /home/kraval/test/grpctest

Environment: K3s based Kubernetes cluster Istio installed using Helm chart (just the Istio-base and Istiod, Ingress Gateway not required)

Steps:

  1. Start server - kubectl apply -f server.yaml
  2. Wait for servers to start, it will start 3 replicas using a deployment
  3. Start client - kubectl apply -f client.yaml
  4. Again 3 replicas of client will start.
  5. Client has 4 different unique strings, 1-AAAA...., 2-BBBB...., 3-CCCC...., 4-DDDD....
  6. With each of above unique string used as a GRPC Metadata pair, the message is sent to server
  7. As we have setup Istio rules to hash on above string which in turn map to GRPC XDS Ring hash, the outgoing messages from client are hashed on above and they reach appropriate server.
  8. Note the server logs of 3 servers. Each unique string reaches exactly 1 server. e.g. 1-AAAA reaches any server, say server-1, irrespective of the client sending it (1/2/3). Hence, you see 3 messages of 1-AAAA reaching the server (one from each client).
  9. kubectl delete pod (pick any one server and client to kill)
  10. Client and servers come back up. Again observe all the 3 server logs
  11. It is observed that one particular unique key gets hashed to different destinations by different clients. You start seeing 1-BBBB messages reaching server-1 as well as at server-2. So, different clients are doing different hashing now.

Thanks. Let me know if you want to try something. hashtest.tgz

k-raval avatar Feb 07 '24 10:02 k-raval

Hi @ginayeh , let me know if you need any help in reproducing the issue. Some more info : I used calico as CNI on the K3s cluster and my Linux was Ubuntu 22.04 with uname: Linux kartik-ubuntu 5.15.0-92-generic #102-Ubuntu SMP Wed Jan 10 09:37:39 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

k-raval avatar Feb 12 '24 06:02 k-raval

Hi @ginayeh , any update so far ? let me know. thanks.

k-raval avatar Feb 19 '24 06:02 k-raval

Hi @arvindbr8 , let me know if you need any help in simulating this. awaiting your reply. thanks.

k-raval avatar Mar 04 '24 04:03 k-raval

@k-raval -- We appreciate your patience here. Just taking a bit to come up to context with xDS Ring Hash consistencies. Will get back to you asap.

arvindbr8 avatar Mar 04 '24 18:03 arvindbr8

Hi @k-raval -- apologies. We are a bit thin on the ground with multiple teammates on vacation. Hence the delayed turn back time.

I would lean on to you for understanding the behavior of istiod here. Looking at your setup, it seems like you are expecting a header based ring hash load balancing. As per the spec:

header: gRPC will support this type of hash policy. This allows hashing on a request header. The header matching logic will behave the same way as specified for route matching, as per gRFC A28 (i.e., it will ignore headers with a -bin suffix and may need special handling for content-type).

And this is the code that actually generates the hash for each RPC (https://github.com/grpc/grpc-go/blob/7c377708dc070bc6f4dfeedfb3c3f38c3410912b/xds/internal/resolver/serviceconfig.go#L224).

Based on the reading above, the header to use for hashing needs to come though the xDS response as a RouteAction in the hash_policy field. I'm not sure how the xDS response in an istio environment would look like and I would like your help here. Could you paste the RouteAction part of the xDS response here so that we could debug better?

PS: The particular scenario when the client and server restarts is one of those cases where the definition of persistence is different for different people. Especially because in gRPC when in the event of a down/upscale of backends (or even if there is a churn of backends) the ring changes. Also its important to note that a new gRPC channel could potentially create a different ring for the same backends based on the ordering of the endpoints response from the xdsResolver.

arvindbr8 avatar Mar 12 '24 21:03 arvindbr8

Thanks @arvindbr8. I had started 3 clients and 3 servers and collected GRPC logs of one particular client to capture XDS response. Attaching the log file here.

For your comment on the ring formation, my comment is - irrespective of the order in which XDS resolver sends the list of server endpoints, the client can order them probably alphabetically to construct the ring, so that all different clients will follow the same behaviour and when we will take a hash on a parameter, it will result in choosing same destination server "across clients". Let me know if this makes sense ?

xds.txt

k-raval avatar Mar 15 '24 13:03 k-raval

Thanks for the logs. The RouteAction looks correct to me with the hash_policy providing the header to use.

"hash_policy":{"header":{"header_name":"ULID"}}

Also, I should rephrase my previous comment about ring formation.

the ring changes

what I meant here is that, its hard to determine if the ring is going to look alike across all the clients. But the current hashing algorithm tries to ensure that an address update only a small portion of the keys in the ring are updates in the same binary. But when the client restarts, the shape of the ring might look totally different

For your comment on the ring formation, my comment is - irrespective of the order in which XDS resolver sends the list of server endpoints, the client can order them probably alphabetically to construct the ring, so that all different clients will follow the same behaviour and when we will take a hash on a parameter, it will result in choosing same destination server "across clients". Let me know if this makes sense ?

The ring hash LB policy is implemented as per the xDS spec. Keeping consistency over different runs of the client doesn't seem to be in the scope of this LB policy. Seems like Ring Hash LB Policy might be too barebones for your usecase here.

A55-xds-stateful-session-affinity.md might something you might be interested in- but with the caveats of client/server shutdown cases. But this feature is pending implementation in golang and has not been prioritized yet.

arvindbr8 avatar Mar 18 '24 21:03 arvindbr8

Hi @arvindbr8 , thanks for explanation.

My requirement is simple. For a given parameter (e.g. ULID string above), I want all the different clients independently started and stateless to hash it to SAME destination server endpoint. If this costs me un-equal load distribution, I am fine with it. If the server endpoints change (add/delete/crash-restart), the same ULID may get hashed to any other server endpoint, that is fine, but I need all the clients to follow the same behaviour independently without exchanging any state.

And I see this working with Envoy proxy doing the load balancing using same configuration. However, to avoid the cost of proxy, I wanted to used GRPC proxy less mode using XDS. Hence expected same behaviour here as well.

Let me know if there is any solution you may see since you are much more into the GRPC codebase.

Thanks again and awaiting your reply.

k-raval avatar Mar 20 '24 03:03 k-raval

Let me followup with the team to get you better guidance for this usecase. Meanwhile, could you confirm that in the case with Envoy proxy you are also using the same ring hash LB policy?

arvindbr8 avatar Mar 20 '24 16:03 arvindbr8

I was not able to find the language which talks about a deterministic ring in all client implementations either in A42: xDS Ring Hash LB Policy or in the Envoy's Ring hash doc. Also there are small caveats in the way the ring hash LB policy between Envoy and gRPC - some of which are mentioned here.

I think the purpose of ring hash is only for best effort consistent hashing from a client -- which seems like not the thing you are looking for.

arvindbr8 avatar Mar 20 '24 18:03 arvindbr8

Hi @arvindbr8 , What I am looking for is very standard for a 'stateless' client. If XDS Ring Hash algorithm can be stateless then it will work the same across multiple clients as well as across client restarts.

I went a little bit into the Ring hash code of GRPC, and also collected some logs. Here are my observations:

  1. Order of server endpoints in XDS response does not matter as the algorithm does a Hash on the server identity while putting them up in the ring.
  2. Weights of the address is used to determine number of entries in the ring. I see that this makes all the difference when I look at two clients.
  3. Since I had not given any locality preference, locality and priority are not of any consequence.
  4. XDS response mentions weight of each endpoint which is exactly same - '1'.
  5. So initially at every client, all 3 server endpoints with weight 1 are added and all have exactly same ring.
  6. Hashing of a particular input is yielding same result across all clients.
  7. Now I kill one server and one client.
  8. Client starts first, gets XDS response with 2 server endpoints again with equal weights - 1. It forms the ring:
2024/03/21 07:28:43 INFO: [xds] [ring-hash-lb 0xc0003f8d80] newRing: number of subConns is 2, minRingSize is 1024, maxRingSize is 4096
2024/03/21 07:28:43 INFO: [xds] [ring-hash-lb 0xc0003f8d80] newRing: normalized subConn weights is [{0xc000482f60 0.5} {0xc000482de0 0.5}]
2024/03/21 07:28:43 INFO: [xds] [ring-hash-lb 0xc0003f8d80] newRing: creating new ring of size 1024

  1. Now the server endpoint is also available (although with different IP address). So, for client, one endpoint got deleted and another got added. Now the ring in the 'restarted' client again changes like this:
2024/03/21 07:28:44 INFO: [xds] [ring-hash-lb 0xc0003f8d80] newRing: number of subConns is 3, minRingSize is 1024, maxRingSize is 4096
2024/03/21 07:28:44 INFO: [xds] [ring-hash-lb 0xc0003f8d80] newRing: normalized subConn weights is [{0xc000482f60 0.2857142857142857} {0xc000482de0 0.2857142857142857} {0xc00012c9c0 0.42857142857142855}]
2024/03/21 07:28:44 INFO: [xds] [ring-hash-lb 0xc0003f8d80] newRing: creating new ring of size 1026
  1. Other client which did not restart has ring like this:
2024/03/21 07:28:44 INFO: [xds] [ring-hash-lb 0xc0005adc00] newRing: number of subConns is 3, minRingSize is 1024, maxRingSize is 4096
2024/03/21 07:28:44 INFO: [xds] [ring-hash-lb 0xc0005adc00] newRing: normalized subConn weights is [{0xc00042a720 0.3333333333333333} {0xc00042a660 0.3333333333333333} {0xc0002e2720 0.3333333333333333}]
2024/03/21 07:28:44 INFO: [xds] [ring-hash-lb 0xc0005adc00] newRing: creating new ring of size 1026

I see that in XDS response, it always sends Load balancing weight of each endpoint as 1. Not sure from where the restarted client picked up different weights. I assume that since this is Ring Hash, the weight of endpoints are not to be dynamically calculated anywhere. They are taken from XDS response as available. Just to test I changed code to always return weight as 1 (which is what is returned in XDS response), then the ring is always the same across clients and things work as expected. However, I need your eyes on this and whether my interpretation is right or wrong.

Thanks.

k-raval avatar Mar 21 '24 12:03 k-raval

We have one more input. With additional logs, we found that the locality weight is the one affecting the overall weight calculation.

Client that restarted along with server:

2024/03/23 14:31:11 INFO: [xds] [ring-hash-lb 0xc00047b580] newRing: number of subConns is 3, minRingSize is 1024, maxRingSize is 4096
2024/03/23 14:31:11 INFO: [xds] [ring-hash-lb 0xc00047b580] NJ subConn ({Addr: "10.196.248.4:16000", ServerName: "", BalancerAttributes: {"<%!p(internal.localityKeyType=grpc.xds.internal.address.locality)>": "<%!p(internal.LocalityID={  })>" , "<%!p(wrrlocality.attributeKey={})>": "Locality Weight: 2" , "<%!p(hierarchy.pathKeyType=grpc.internal.address.hierarchical_path)>": "<0xc00007ac30>" , "<%!p(weightedroundrobin.attributeKey={})>": "Weight: 2" }}), Weight (2)
2024/03/23 14:31:11 INFO: [xds] [ring-hash-lb 0xc00047b580] NJ subConn ({Addr: "10.196.248.20:16000", ServerName: "", BalancerAttributes: {"<%!p(hierarchy.pathKeyType=grpc.internal.address.hierarchical_path)>": "<0xc00040c850>" , "<%!p(internal.localityKeyType=grpc.xds.internal.address.locality)>": "<%!p(internal.LocalityID={  })>" , "<%!p(wrrlocality.attributeKey={})>": "Locality Weight: 3" , "<%!p(weightedroundrobin.attributeKey={})>": "Weight: 3" }}), Weight (3)
2024/03/23 14:31:11 INFO: [xds] [ring-hash-lb 0xc00047b580] NJ subConn ({Addr: "10.196.248.45:16000", ServerName: "", BalancerAttributes: {"<%!p(internal.localityKeyType=grpc.xds.internal.address.locality)>": "<%!p(internal.LocalityID={  })>" , "<%!p(weightedroundrobin.attributeKey={})>": "Weight: 2" , "<%!p(wrrlocality.attributeKey={})>": "Locality Weight: 2" , "<%!p(hierarchy.pathKeyType=grpc.internal.address.hierarchical_path)>": "<0xc00007ac10>" }}), Weight (2)
2024/03/23 14:31:11 INFO: [xds] [ring-hash-lb 0xc00047b580] newRing: normalized subConn weights is [{0xc000582960 0.42857142857142855} {0xc00049df20 0.2857142857142857} {0xc00052c060 0.2857142857142857}]
2024/03/23 14:31:11 INFO: [xds] [ring-hash-lb 0xc00047b580] newRing: creating new ring of size 1026

Client that did not restart during above:

2024/03/23 14:31:11 INFO: [xds] [ring-hash-lb 0xc000646400] newRing: number of subConns is 3, minRingSize is 1024, maxRingSize is 4096
2024/03/23 14:31:11 INFO: [xds] [ring-hash-lb 0xc000646400] NJ subConn ({Addr: "10.196.248.45:16000", ServerName: "", BalancerAttributes: {"<%!p(hierarchy.pathKeyType=grpc.internal.address.hierarchical_path)>": "<0xc0004102b0>" , "<%!p(internal.localityKeyType=grpc.xds.internal.address.locality)>": "<%!p(internal.LocalityID={  })>" , "<%!p(wrrlocality.attributeKey={})>": "Locality Weight: 3" , "<%!p(weightedroundrobin.attributeKey={})>": "Weight: 3" }}), Weight (3)
2024/03/23 14:31:11 INFO: [xds] [ring-hash-lb 0xc000646400] NJ subConn ({Addr: "10.196.248.4:16000", ServerName: "", BalancerAttributes: {"<%!p(wrrlocality.attributeKey={})>": "Locality Weight: 3" , "<%!p(weightedroundrobin.attributeKey={})>": "Weight: 3" , "<%!p(hierarchy.pathKeyType=grpc.internal.address.hierarchical_path)>": "<0xc0004102d0>" , "<%!p(internal.localityKeyType=grpc.xds.internal.address.locality)>": "<%!p(internal.LocalityID={  })>" }}), Weight (3)
2024/03/23 14:31:11 INFO: [xds] [ring-hash-lb 0xc000646400] NJ subConn ({Addr: "10.196.248.20:16000", ServerName: "", BalancerAttributes: {"<%!p(hierarchy.pathKeyType=grpc.internal.address.hierarchical_path)>": "<0xc0004111b0>" , "<%!p(internal.localityKeyType=grpc.xds.internal.address.locality)>": "<%!p(internal.LocalityID={  })>" , "<%!p(wrrlocality.attributeKey={})>": "Locality Weight: 3" , "<%!p(weightedroundrobin.attributeKey={})>": "Weight: 3" }}), Weight (3)
2024/03/23 14:31:11 INFO: [xds] [ring-hash-lb 0xc000646400] newRing: normalized subConn weights is [{0xc0006c26c0 0.3333333333333333} {0xc000488a20 0.3333333333333333} {0xc000488ae0 0.3333333333333333}]
2024/03/23 14:31:11 INFO: [xds] [ring-hash-lb 0xc000646400] newRing: creating new ring of size 1026

We did not provide any locality weight in Istio configuration. Let us know if this can be fixed via any configuration aspect ?

Thanks.

k-raval avatar Mar 26 '24 07:03 k-raval

Thanks for the additional information. I will take a look at it shortly

arvindbr8 avatar Mar 26 '24 19:03 arvindbr8

@k-raval - Could you clarify if the ADS response dump that you provided us earlier was for the client that restarts or the one that doesnt?

arvindbr8 avatar Mar 26 '24 21:03 arvindbr8

Attaching complete logs of both clients. client-1 - which did not restart, was running fine client-2 - which restarted along with server, logs after restart attached.

client-1.txt client-2-restarted.txt

k-raval avatar Mar 27 '24 06:03 k-raval

Based on the ClusterLoadAssignment ADS response, I'm not able to reproduce the same ring on my side. I'm alternatively trying to see if the I can reproduce the issue if I setup a cluster with Istiod using helm chart.

I'm seeing that the CDS response is missing locality field. Was that stripped off from the dump? If so, I guess there is no concern, but otherwise that could be a potential for endpoints to be ignored from the response. The calculation for generating the ring should be rather simple in your case with just one priority and equal weights. I'm trying to figure out what configuration would achieve the same set of normalized weights that we see in your case.

Since I'm not been able to reproduce the same on my side. One way forward I can think of is to add more debug statements around when when UpdateClientConnState is called to generate the new ring. I can add that code to a branch. Do you mind updating your client code to use the branch so that I get more information out?

arvindbr8 avatar Apr 03 '24 16:04 arvindbr8

Thanks. The logs were as it is, nothing stripped from it. I can test from your branch with logs added. Let me know the branch.

k-raval avatar Apr 03 '24 16:04 k-raval

Apologies for the delay. Could you try with this branch ?

arvindbr8 avatar Apr 09 '24 20:04 arvindbr8

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions[bot] avatar Apr 15 '24 20:04 github-actions[bot]

Removing the stale label to give more time on for the reproduction, since there might be a potential bug here.

arvindbr8 avatar Apr 15 '24 21:04 arvindbr8

Hi @arvindbr8, I was able to reproduce the issue with your branch with logs enabled.

Sequence of events:

  • 3 clients, 3 servers running
  • Client-2 and Server-2 killed at the same time.
  • Somehow, no issue was observed
  • Client-1 and Server-1 killed at the same time
  • Issue now observed.

Attaching logs: client2.txt - client2 logs before restart client2-restarted.txt - client2 logs after restart client1.txt - client1 logs before restart client1-restarted.txt - client1 logs after restart (issue is seen here in these logs) client3.txt - client3 - never restarted, has all events - both client1 and client2 restart in logs.

testlogs.tgz

k-raval avatar Apr 16 '24 09:04 k-raval

Thanks.

Somehow, no issue was observed

Shows evidence of not being deterministic. I'm sort of inclined now to believe if there is an interaction with istiod ADS responses which makes the behavior of gRPC non deterministic. Anyways, I'll take a look

arvindbr8 avatar Apr 17 '24 21:04 arvindbr8

I feel like I have found what the issue could be. I have pushed some more commits to https://github.com/arvindbr8/grpc-go/tree/dbug_6927. This contains a potential fix and other debug messages (which would be useful in case the fix does not completely fix the issue)

Do you mind running the test again with the HEAD @ https://github.com/arvindbr8/grpc-go/tree/dbug_6927

arvindbr8 avatar Apr 17 '24 22:04 arvindbr8

Hi @arvindbr8 ,happy to inform that with your new commit, the issue seems to have been fixed. attaching the logs again for your review. thanks a lot for putting so much effort.

Same procedure I tried twice and each time I did two sets of restarts (each set consisting of one client and one server). Attaching logs from both runs.

logsnew2.tgz logsnew.tgz

k-raval avatar Apr 18 '24 11:04 k-raval

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions[bot] avatar Apr 24 '24 12:04 github-actions[bot]