etcd icon indicating copy to clipboard operation
etcd copied to clipboard

lease: Fix incorrect gRPC Unavailable on client cancel during LeaseKeepAlive forwarding

Open zhijun42 opened this issue 1 week ago • 2 comments

Helps investigate the long-lasting issue https://github.com/etcd-io/etcd/issues/13632 (Github action mistakenly closed it due to inactivity).

Problem

Some users reported that grpc_server_handled_total{grpc_code="Unavailable"} was unexpectedly inflating for LeaseKeepAlive requests even when the cluster is healthy.

Fix

The function LeaseServer.LeaseKeepAlive always turns context.Canceled (which is grpc codes.Canceled) into rpctypes.ErrGRPCNoLeader (which is grpc codes.Unavailable), even when it’s the client that initializes the cancellation. As a result, the grpc metrics count incorrectly.

The old comment is wrong: // the only server-side cancellation is noleader for now.

In fact, there’s no server-side cancellation inside the worker function EtcdServer.LeaseRenew path. The only possible scenario when this function returns errors.ErrCanceled is when the client cancels the request, and then the Done signal propagates into this function.

The fix is pretty straightforward. To validate the fix, I added two test cases where I send LeaseKeepAlive request to one follower in the cluster, and while it’s forwarding the request to the leader, I block the leader’s ServeHTTP path via go failpoint.

As the process is blocked, one test case cancels the request, while the other waits until the forwarding request times out. Both cases should receive expected errors.

zhijun42 avatar Dec 24 '25 13:12 zhijun42