ServerWatchStream can't be closed when client closed watcher
Bug report criteria
- [x] This bug report is not security related, security issues should be disclosed privately via [email protected].
- [x] This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
- [x] You have read the etcd bug reporting guidelines.
- [x] Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.
What happened?
ServerWatchStream can't be closed when client closed watcher
What did you expect to happen?
when client closed watcher ServerWatchStream should be closed
How can we reproduce it (as minimally and precisely as possible)?
using jetcd write the follow code and run while(1) { Watch.Watcher watcher = etcdClient.getWatchClient().watch(....); TimeUnit.SECONDS.sleep(5); watcher.close(); }
Anything else we need to know?
func (ws *watchServer) Watch(stream pb.Watch_WatchServer) (err error) { in this function, sws.recvLoop() return nil when recv EOF,so the line errc <- rerr will never be executed. go the below select will never be executed. this lead to serverWatchStream leak.
Etcd version (please run commands below)
$ etcd --version
# 3.5.17
$ etcdctl version
# 3.5.17
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
$ etcdctl member list -w table
# paste output here
$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here
Relevant log output
Yeah, although this PR seems fixed some issue, but I don't see any difference.
Adding a <-sws.closec: branch when a cancel watch request is received does not do any help.
select {
case sws.ctrlStream <- wr:
case <-sws.closec:
return nil
}
any updates on this? This is a memory leak and may cause catastrophic problems. we are hoping this bug can be fixed.
@serathius @apullo777 can you help and take a look at this issue?
@ahrtr would you mind taking a look?
Hi,
I'm new to contributing to etcd and have been studying the codebase. I believe I found a solution to the memory leak issue.
The memory leak occurs due to improper resource cleanup in serverWatchStream.Watch():
-
Race condition: When client cancels context quickly, the select statement chooses
<-stream.Context().Done()beforeerrcreceives the error fromrecvLoop, causing incomplete cleanup. -
EOF case: When
recvLoop()returns nil (EOF), no signal is sent toerrc, causing the select to hang indefinitely.
The fix involves three key changes:
- Added synchronization channel to ensure
recvLoopgoroutine always completes - Fixed EOF case by sending
io.EOFtoerrcwhenrecvLoop()returns nil - Unified cleanup by always closing
ctrlStreamregardless of which select case triggers
func (ws *watchServer) Watch(stream pb.Watch_WatchServer) (err error) {
// ... existing code ...
errc := make(chan error, 1)
done := make(chan struct{}) // NEW: synchronization channel
go func() {
defer close(done) // NEW: signal completion
if rerr := sws.recvLoop(); rerr != nil {
// ... existing error handling ...
errc <- rerr
} else {
// NEW: Handle EOF case to prevent hanging select
errc <- io.EOF
}
}()
select {
case err = <-errc:
if errors.Is(err, context.Canceled) {
err = rpctypes.ErrGRPCWatchCanceled
}
case <-stream.Context().Done():
err = stream.Context().Err()
if errors.Is(err, context.Canceled) {
err = rpctypes.ErrGRPCWatchCanceled
}
// NEW: Wait for recvLoop to complete to prevent resource leaks
<-done
}
// NEW: Always close ctrlStream for proper cleanup
close(sws.ctrlStream)
sws.close()
return err
}
Would appreciate feedback on this approach.
@Siomachkin thank you for your reply, we have thought about your "Fixed EOF case by sending io.EOF to errc when recvLoop() returns nil", we think it would be normal for the client to close the write part(half close) of the stream because if a client thinks that it does not need to write anything to the ectd server anymore it can close the write part of the stream but it may still wants to get notified when the key it is watching changes, so if the client close the write part, the etcd server may get an EOF, but the server should not close the stream.
One possible solution would be for the server to check if the stream has no more watchers being watched, then it can close the stream. But there is one thing to be considered, for the watch api, the grpc method is defined as a binary stream call, I was wondering if it is allowed for the client to send more then one watch requests to the etcd server on one stream, if it is allowed, then the server should do nothing, the client should be responsible for closing all the streams.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.