grpc-proxy: subsequent client watchers hangs after auth token expires
Bug report criteria
- [x] This bug report is not security related, security issues should be disclosed privately via etcd maintainers.
- [x] This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
- [x] You have read the etcd bug reporting guidelines.
- [x] Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.
What happened?
When a client connects to the grpc proxy with an etcd cluster that has authentication enabled, new clients connecting after the auth token TTL has expired will not receive any watch events.
What did you expect to happen?
Clients connecting to the proxy after the first client's token expired should be able to establish their watch successfully and get events.
How can we reproduce it (as minimally and precisely as possible)?
This test reproduces the issue
diff --git a/tests/e2e/etcd_grpcproxy_test.go b/tests/e2e/etcd_grpcproxy_test.go
index 02174e89f..0e109779d 100644
--- a/tests/e2e/etcd_grpcproxy_test.go
+++ b/tests/e2e/etcd_grpcproxy_test.go
@@ -17,6 +17,8 @@ package e2e
import (
"context"
"strings"
+ "sync"
+ "sync/atomic"
"testing"
"time"
@@ -142,3 +144,97 @@ func waitForEndpointInLog(ctx context.Context, proxyProc *expect.ExpectProcess,
return err
}
+
+func TestGRPCProxyWatchersAfterTokenExpiry(t *testing.T) {
+ ctx, cancel := context.WithCancel(context.Background())
+ defer cancel()
+ cluster, err := e2e.NewEtcdProcessCluster(ctx, t,
+ e2e.WithClusterSize(1),
+ e2e.WithAuthTokenOpts("simple"),
+ e2e.WithAuthTokenTTL(1),
+ )
+ require.NoError(t, err)
+ t.Cleanup(func() { require.NoError(t, cluster.Stop()) })
+
+ cli := cluster.Etcdctl()
+
+ createUsers(ctx, t, cli)
+
+ require.NoError(t, cli.AuthEnable(ctx))
+
+ var (
+ node1ClientURL = cluster.Procs[0].Config().ClientURL
+ proxyClientURL = "127.0.0.1:42379"
+ )
+
+ proxyProc, err := e2e.SpawnCmd([]string{
+ e2e.BinPath.Etcd, "grpc-proxy", "start",
+ "--advertise-client-url", proxyClientURL,
+ "--listen-addr", proxyClientURL,
+ "--endpoints", node1ClientURL,
+ }, nil)
+ require.NoError(t, err)
+ t.Cleanup(func() { require.NoError(t, proxyProc.Stop()) })
+
+ var totalEventsCount int64
+
+ handler := func(events clientv3.WatchChan) {
+ for {
+ select {
+ case ev, open := <-events:
+ if !open {
+ return
+ }
+ if ev.Err() != nil {
+ t.Logf("watch response error: %s", ev.Err())
+ continue
+ }
+ atomic.AddInt64(&totalEventsCount, 1)
+ case <-ctx.Done():
+ return
+ }
+ }
+ }
+
+ withAuth := e2e.WithAuth("root", "rootPassword")
+ withEndpoint := e2e.WithEndpoints([]string{proxyClientURL})
+
+ events := cluster.Etcdctl(withAuth, withEndpoint).Watch(ctx, "/test", config.WatchOptions{Prefix: true, Revision: 1})
+
+ wg := sync.WaitGroup{}
+
+ wg.Add(1)
+ go func() {
+ defer wg.Done()
+ handler(events)
+ }()
+
+ clusterCli := cluster.Etcdctl(withAuth)
+ require.NoError(t, clusterCli.Put(ctx, "/test/1", "test", config.PutOptions{}))
+ require.NoError(t, err)
+
+ time.Sleep(time.Second * 2)
+
+ events2 := cluster.Etcdctl(withAuth, withEndpoint).Watch(ctx, "/test", config.WatchOptions{Prefix: true, Revision: 1})
+
+ wg.Add(1)
+ go func() {
+ defer wg.Done()
+ handler(events2)
+ }()
+
+ events3 := cluster.Etcdctl(withAuth, withEndpoint).Watch(ctx, "/test", config.WatchOptions{Prefix: true, Revision: 1})
+
+ wg.Add(1)
+ go func() {
+ defer wg.Done()
+ handler(events3)
+ }()
+
+ time.Sleep(time.Second)
+
+ cancel()
+ wg.Wait()
+
+ assert.Equal(t, int64(3), atomic.LoadInt64(&totalEventsCount))
+}
diff --git a/tests/framework/e2e/cluster.go b/tests/framework/e2e/cluster.go
index 3a2f83888..ef7e257f0 100644
--- a/tests/framework/e2e/cluster.go
+++ b/tests/framework/e2e/cluster.go
@@ -296,6 +296,10 @@ func WithRollingStart(rolling bool) EPClusterOption {
return func(c *EtcdProcessClusterConfig) { c.RollingStart = rolling }
}
+func WithAuthTokenTTL(ttl uint) EPClusterOption {
+ return func(c *EtcdProcessClusterConfig) { c.ServerConfig.AuthTokenTTL = ttl }
+}
+
func WithDiscovery(discovery string) EPClusterOption {
return func(c *EtcdProcessClusterConfig) { c.Discovery = discovery }
}
Anything else we need to know?
Proposed a potential fix in PR: https://github.com/etcd-io/etcd/pull/19033
Etcd version (please run commands below)
$ etcd --version
etcd Version: 3.5.17
Git SHA: 762e93874
Go Version: go1.22.10
Go OS/Arch: linux/amd64
$ etcdctl version
etcdctl version: 3.5.17
API version: 3.5
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
$ etcdctl member list -w table
# paste output here
$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here
Relevant log output
We covered this in our previous triage meeting, and today again. By reading the description, it doesn't sound to be a bug to try to use an expired token to establish a new connection. Or at least, that's what we understand from the description of your bug. Can you please clarify further?
I can see now that the description might be a big vague. I will try and clarify the issue with more details.
The problem can be reproduced like this:
- Client A connects to proxy using user/pass authentication
- Client A starts watching a key
- Clients A tokens expires
- Client B connects to proxy using user/pass authentication
- Client B starts watching same key as client A
- Both client A and B watchers now hang
The included e2e test included above reproduces it
Thanks for clarifying, @krijohs. It sounds like a valid issue.
cc. @ahrtr
Hi, just checking if anyone has had time to take a look at this? @ivanvc
Hi, @krijohs, unfortunately, the grpcproxy has few contributors. I'll bring this topic/issue to the next triage meeting.
@ivanvc I can have a look at this if you want. I was able to reproduce this on my local.
@nwnt Ive added a PR with possible solution in https://github.com/etcd-io/etcd/pull/19033 which addresses this issue, would be great to get your feedback on the approach.
@krijohs thanks for letting me know. Let me find some time to look at that PR. You should hear back from me in a couple days.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.