etcd
etcd copied to clipboard
Potential goroutine leak in serverWatchStream
trafficstars
Bug report criteria
- [X] This bug report is not security related, security issues should be disclosed privately via etcd maintainers.
- [X] This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
- [X] You have read the etcd bug reporting guidelines.
- [X] Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.
What happened?
We run several moderately loaded clusters with the following characteristics:
- Tens of thousands of clients (steady count)
- Millions of watchers
- Hundreds of thousands of keys
We had a few performance issues at this scale, thus client has a logic to:
- Spread watchers across multiple http streams.
- Periodically move watchers from one stream to another to eventually even watcher load between etcd server nodes. It's implemented via context metadata:
watchCtx = metadata.AppendToOutgoingContext(watchCtx,
"watch_group", strconv.Itoa(w.streamId),
"generation", strconv.FormatInt(w.generation, 10),
)
We observe a slow what seems to be a slow goroutine leak on our clusters.
root@pdx3a-rp11-24a:~# curl --cert /srv/grpc-certs/etcd/latest/cert.pem --key /srv/grpc-certs/etcd/latest/key.pem --http1.1 -k -s https://localhost:5000/debug/pprof/goroutine?debug=1 | head -n 10
goroutine profile: total 177241
43832 @ 0x43fe6e 0x40a6a5 0x40a2f7 0xd9bb8d 0xd9af66 0x473181
# 0xd9bb8c go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*serverWatchStream).recvLoop+0x70c external/io_etcd_go_etcd_server_v3/etcdserver/api/v3rpc/watch.go:348
# 0xd9af65 go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*watchServer).Watch.func2+0x45 external/io_etcd_go_etcd_server_v3/etcdserver/api/v3rpc/watch.go:191
21683 @ 0x43fe6e 0x438617 0x46d665 0x4a77a7 0x4a8a9a 0x4a8a88 0x627365 0x637d45 0xdb842f 0xdb995e 0x68f39b 0x50f178 0x68f57e 0x68cb50 0x692e18 0x692e21 0x4a29f0 0x8b4c85 0x8b4c54 0x8b53c5 0x8c0227 0x473181
# 0x46d664 internal/poll.runtime_pollWait+0x84 GOROOT/src/runtime/netpoll.go:343
# 0x4a77a6 internal/poll.(*pollDesc).wait+0x26 GOROOT/src/internal/poll/fd_poll_runtime.go:84
# 0x4a8a99 internal/poll.(*pollDesc).waitRead+0x279 GOROOT/src/internal/poll/fd_poll_runtime.go:89
# 0x4a8a87 internal/poll.(*FD).Read+0x267 GOROOT/src/internal/poll/fd_unix.go:164
----------------------
# after a few seconds:
----------------------
root@pdx3a-rp11-24a:~# curl --cert /srv/grpc-certs/etcd/latest/cert.pem --key /srv/grpc-certs/etcd/latest/key.pem --http1.1 -k -s https://localhost:5000/debug/pprof/goroutine?debug=1 | head -n 10
goroutine profile: total 177042
43836 @ 0x43fe6e 0x40a6a5 0x40a2f7 0xd9bb8d 0xd9af66 0x473181
# 0xd9bb8c go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*serverWatchStream).recvLoop+0x70c external/io_etcd_go_etcd_server_v3/etcdserver/api/v3rpc/watch.go:348
# 0xd9af65 go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*watchServer).Watch.func2+0x45 external/io_etcd_go_etcd_server_v3/etcdserver/api/v3rpc/watch.go:191
21684 @ 0x43fe6e 0x438617 0x46d665 0x4a77a7 0x4a8a9a 0x4a8a88 0x627365 0x637d45 0xdb842f 0xdb995e 0x68f39b 0x50f178 0x68f57e 0x68cb50 0x692e18 0x692e21 0x4a29f0 0x8b4c85 0x8b4c54 0x8b53c5 0x8c0227 0x473181
# 0x46d664 internal/poll.runtime_pollWait+0x84 GOROOT/src/runtime/netpoll.go:343
# 0x4a77a6 internal/poll.(*pollDesc).wait+0x26 GOROOT/src/internal/poll/fd_poll_runtime.go:84
# 0x4a8a99 internal/poll.(*pollDesc).waitRead+0x279 GOROOT/src/internal/poll/fd_poll_runtime.go:89
# 0x4a8a87 internal/poll.(*FD).Read+0x267 GOROOT/src/internal/poll/fd_unix.go:164
so this one in particular seems to be a culprit:
43832 @ 0x43fe6e 0x40a6a5 0x40a2f7 0xd9bb8d 0xd9af66 0x473181
# 0xd9bb8c go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*serverWatchStream).recvLoop+0x70c external/io_etcd_go_etcd_server_v3/etcdserver/api/v3rpc/watch.go:348
# 0xd9af65 go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*watchServer).Watch.func2+0x45 external/io_etcd_go_etcd_server_v3/etcdserver/api/v3rpc/watch.go:191
I suspect that it's due to stream not properly closed after all watchers have migrated to a new h/2 stream.
What did you expect to happen?
Steady number of goroutines over time.
How can we reproduce it (as minimally and precisely as possible)?
I'd assume creating and stopping a set of watchers each one with own context metadata would trigger the issue.
Anything else we need to know?
No response
Etcd version (please run commands below)
$ etcd --version
etcd Version: 3.5.14
Git SHA: Not provided (use ./build instead of go build)
Go Version: go1.21.8 X:nocoverageredesign
Go OS/Arch: linux/amd64
$ etcdctl version
etcdctl version: 3.5.14
API version: 3.5
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
$ etcdctl member list -w table
+------------------+---------+----------------+----------------------------+----------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+----------------+----------------------------+----------------------------+------------+
| 20f075a9d63e8059 | started | pdx5a-ra32-5b | https://10.168.15.59:5100 | https://10.168.15.59:5000 | false |
| 4ae19b2029c5b406 | started | pdx5a-ra34-6a | https://10.168.25.61:5100 | https://10.168.25.61:5000 | false |
| 58bc11a0454b8612 | started | pdx3a-rq24-21a | https://10.90.71.28:5100 | https://10.90.71.28:5000 | false |
| 68000ca8f34d6c48 | started | pdx3a-rw6-9b | https://10.90.197.51:5100 | https://10.90.197.51:5000 | false |
| afd4dcb6f2b8a05f | started | pdx4a-ri13-9a | https://10.136.212.58:5100 | https://10.136.212.58:5000 | false |
| d008c4eba2073739 | started | pdx4a-rd10-21a | https://10.136.25.28:5100 | https://10.136.25.28:5000 | false |
| e58692a57c5202f4 | started | pdx3a-rp11-24a | https://10.90.34.43:5100 | https://10.90.34.43:5000 | false |
+------------------+---------+----------------+----------------------------+----------------------------+------------+
$ etcdctl --endpoints=<member list> endpoint status -w table
+--------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 10.136.212.58:5000 | afd4dcb6f2b8a05f | 3.5.14 | 2.6 GB | false | false | 369 | 1163505937 | 1163505937 | |
| 10.136.25.28:5000 | d008c4eba2073739 | 3.5.14 | 2.6 GB | false | false | 369 | 1163505938 | 1163505938 | |
| 10.168.15.59:5000 | 20f075a9d63e8059 | 3.5.14 | 2.6 GB | false | false | 369 | 1163505938 | 1163505938 | |
| 10.168.25.61:5000 | 4ae19b2029c5b406 | 3.5.14 | 2.6 GB | false | false | 369 | 1163505939 | 1163505939 | |
| 10.90.197.51:5000 | 68000ca8f34d6c48 | 3.5.14 | 2.6 GB | false | false | 369 | 1163505939 | 1163505937 | |
| 10.90.34.43:5000 | e58692a57c5202f4 | 3.5.14 | 2.6 GB | true | false | 369 | 1163505939 | 1163505939 | |
| 10.90.71.28:5000 | 58bc11a0454b8612 | 3.5.14 | 2.6 GB | false | false | 369 | 1163505939 | 1163505939 | |
+--------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Relevant log output
No response