goworker icon indicating copy to clipboard operation
goworker copied to clipboard

Goworker does not recover from loss of connection to redis

Open joamaki opened this issue 10 years ago • 3 comments

If the connection to redis goes down (due to redis restart, netsplit, ...) the library does not recover, but only logs the error:

2014-03-10 19:07:23,682 DEBG 'my-worker' stdout output:
1394474843680784649 [Error] Error on my-worker-1:14479-poller:my_queue getting job from [my_queue]: use of closed network connection

Goworker should recognize the redis client instance is dead and try reconnecting.

joamaki avatar Mar 10 '14 18:03 joamaki

I modified GetConn to check connection status and perform redial. Here's the changes:

 func GetConn() (*RedisConn, error) {
    resource, err := pool.Get()

    if err != nil {
        return nil, err
    }
+
+   // Performs simple connection check. Redial if needed
+   _, err = resource.(*RedisConn).Do("PING")
+   if err != nil {
+       resource.(*RedisConn).Close()
+       resource, err = redisConnFromUri(uri)
+       if err != nil {
+           return nil, err
+       }
+   }
+
    return resource.(*RedisConn), nil
 }

Not sure if this is recommended way when working with vitess' pools.

I can make a PR if this changes are OK :)

subosito avatar Sep 15 '14 05:09 subosito

I know this is an old thread but maybe this will help someone like me who found it when trying to handle disconnects. This solution does not work because the connection does not get put back in the pool. When the pool tries to close it hangs on draining the chan that contains the connections in the pool. I opted to handle the error outside of goworker. It required a patch to get it to actually throw an error though. See my PR here #60

jpatters avatar Nov 04 '17 02:11 jpatters

The main issue is that on the poller.poll it would only return an error if it happens before the goroutine, once the goroutine is lunched Work will always return nil and just finish.

The only way I can think of having an error be returned as everything is running in goroutines and the main Work is blocked with ah sync.WaitGroup would be to have a make(chan error, 0) be part of the internal struct on the poller.newPoller so we can "just" <- poller.err and have the WaitGroup be also in the select (with a goroutine and a channel that i'll be closed once it's done).

  	waitCh := make(chan struct{})
	go func() {
		monitor.Wait()
		close(waitCh)
	}()
	select {
	case <-waitCh:
	case err := <-poller.error:
		// quite to gracefully shut down
		quit <- true
		// Return the poller error
		return err
	}

All the possible errors from the workers are also related to Redis and and they'll be closed using the quite (well they'll fail as there will be no connection to Redis but that's expected hehe).

Would this be ok to open a PR for?

xescugc avatar Feb 03 '21 12:02 xescugc