etcd icon indicating copy to clipboard operation
etcd copied to clipboard

ServerWatchStream can't be closed when client closed watcher

Open chaofengliu-okg opened this issue 5 months ago • 7 comments

Bug report criteria

What happened?

ServerWatchStream can't be closed when client closed watcher

What did you expect to happen?

when client closed watcher ServerWatchStream should be closed

How can we reproduce it (as minimally and precisely as possible)?

using jetcd write the follow code and run while(1) { Watch.Watcher watcher = etcdClient.getWatchClient().watch(....); TimeUnit.SECONDS.sleep(5); watcher.close(); }

Anything else we need to know?

func (ws *watchServer) Watch(stream pb.Watch_WatchServer) (err error) { in this function, sws.recvLoop() return nil when recv EOF,so the line errc <- rerr will never be executed. go the below select will never be executed. this lead to serverWatchStream leak.

Etcd version (please run commands below)

$ etcd --version
# 3.5.17

$ etcdctl version
# 3.5.17

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output


chaofengliu-okg avatar Aug 14 '25 02:08 chaofengliu-okg

Yeah, although this PR seems fixed some issue, but I don't see any difference.

Adding a <-sws.closec: branch when a cancel watch request is received does not do any help.

select {
					case sws.ctrlStream <- wr:
					case <-sws.closec:
						return nil
					}

exceptionplayer avatar Aug 14 '25 02:08 exceptionplayer

any updates on this? This is a memory leak and may cause catastrophic problems. we are hoping this bug can be fixed.

adamsau-okg avatar Sep 02 '25 06:09 adamsau-okg

@serathius @apullo777 can you help and take a look at this issue?

adamsau-okg avatar Sep 02 '25 12:09 adamsau-okg

@ahrtr would you mind taking a look?

adamsau-okg avatar Sep 03 '25 03:09 adamsau-okg

Hi,

I'm new to contributing to etcd and have been studying the codebase. I believe I found a solution to the memory leak issue.

The memory leak occurs due to improper resource cleanup in serverWatchStream.Watch():

  1. Race condition: When client cancels context quickly, the select statement chooses <-stream.Context().Done() before errc receives the error from recvLoop, causing incomplete cleanup.

  2. EOF case: When recvLoop() returns nil (EOF), no signal is sent to errc, causing the select to hang indefinitely.

The fix involves three key changes:

  1. Added synchronization channel to ensure recvLoop goroutine always completes
  2. Fixed EOF case by sending io.EOF to errc when recvLoop() returns nil
  3. Unified cleanup by always closing ctrlStream regardless of which select case triggers
func (ws *watchServer) Watch(stream pb.Watch_WatchServer) (err error) {
    // ... existing code ...
    
    errc := make(chan error, 1)
    done := make(chan struct{})  // NEW: synchronization channel
    
    go func() {
        defer close(done)  // NEW: signal completion
        if rerr := sws.recvLoop(); rerr != nil {
            // ... existing error handling ...
            errc <- rerr
        } else {
            // NEW: Handle EOF case to prevent hanging select
            errc <- io.EOF
        }
    }()
    
    select {
    case err = <-errc:
        if errors.Is(err, context.Canceled) {
            err = rpctypes.ErrGRPCWatchCanceled
        }
    case <-stream.Context().Done():
        err = stream.Context().Err()
        if errors.Is(err, context.Canceled) {
            err = rpctypes.ErrGRPCWatchCanceled
        }
        // NEW: Wait for recvLoop to complete to prevent resource leaks
        <-done
    }
    
    // NEW: Always close ctrlStream for proper cleanup
    close(sws.ctrlStream)
    sws.close()
    
    return err
}

Would appreciate feedback on this approach.

Siomachkin avatar Sep 07 '25 21:09 Siomachkin

@Siomachkin thank you for your reply, we have thought about your "Fixed EOF case by sending io.EOF to errc when recvLoop() returns nil", we think it would be normal for the client to close the write part(half close) of the stream because if a client thinks that it does not need to write anything to the ectd server anymore it can close the write part of the stream but it may still wants to get notified when the key it is watching changes, so if the client close the write part, the etcd server may get an EOF, but the server should not close the stream.

One possible solution would be for the server to check if the stream has no more watchers being watched, then it can close the stream. But there is one thing to be considered, for the watch api, the grpc method is defined as a binary stream call, I was wondering if it is allowed for the client to send more then one watch requests to the etcd server on one stream, if it is allowed, then the server should do nothing, the client should be responsible for closing all the streams.

exceptionplayer avatar Sep 09 '25 05:09 exceptionplayer

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Nov 09 '25 00:11 github-actions[bot]