slack Honor deadman timer, fixes #1093

NOTE: The original description of this PR is outdated, but left here for reference. See the discussion for how the PR was updated.

[X] Run make pr-prep from the root of the repository to run formatting, linting and tests.

This patch modifies the main run loop for the socket mode connection. Based on reading the comments, it appears the intention was for run() to return context.Cancelled in the event of a cancelled context, or nil in the event of a different error. This based on the comment:

// wg.Wait() finishes only after any of the above go routines finishes.

... but based on the use of the wait group, it appears that ALL of the goroutines would have to finish, causing the the deadman time to be ignored. This patch removes the wait group (and wg.Wait()) and instead selects on either 1) context cancelled, or 2) timer expired.

I believe this preserves the original intent of the code - but I could be wrong! In any case, this patch fixes #1093 for me, and my robot still runs correctly AFAICS. Ping to @xNok (the original PR author) to have a look.

Jul 28 '22 16:07 parsley42

I didn't write that part, but the wait group is supposed to be the right approach.

Yes, ALL of the goroutines would have to finish, but this is normal because you want a gracefull shutdown. When you want to terminate the run, you need to cancel the context, and then each goroutine should stop gracefully.

Ideally, when the deadman triggers, you should cancel the context and wait for all goroutine to terminate, then reconnect.

As to why it doesn't work properly, I don't know yet :disappointed:

Jul 29 '22 10:07 xNok

Yes, ALL of the goroutines would have to finish, but this is normal because you want a gracefull shutdown. When you want to terminate the run, you need to cancel the context, and then each goroutine should stop gracefully.

Ideally, when the deadman triggers, you should cancel the context and wait for all goroutine to terminate, then reconnect.

Ah, ok, thanks for the - hah! - context. Looking at the code, I see what you mean - so, the root of the problem is that one of the items being waited on is ReadJSON, which doesn't take a context. I'm going to mark the PR as draft and work on adding back the waitGroup and trying a different approach.

Jul 29 '22 12:07 parsley42

Ok @xNok and @mumoshu - this much smaller PR fixes things for me, and may be "good enough". In short, the runMessageReceiver thread eventually blocks in ReadJSON on the websocket, which doesn't take a context and so doesn't exit gracefully on cancel. Now, all my PR does it to remove that thread from the waitGroup and update some comments. How does this look?

Jul 29 '22 13:07 parsley42

@parsley42 Hey! Thanks for the fix. I was rereading the code I've written but missed one thing- It would be great if you could clarify it if possible. So, In L145-147 we have:

		case <-ctx.Done():
			// Detect when the connection is dead and try close connection.
			if err = conn.Close(); err != nil {

which is running in another goroutine. I think it is supposed to chime in to close the connection on a context cancellation which in turn fails the ReadJSON attempt on the reader(=websocket.Conn) asap.

Is it that the connection close doesn't result in immediately failing ReadJSON hence your fix is needed?

Jul 30 '22 08:07 mumoshu

Is it that the connection close doesn't result in immediately failing ReadJSON hence your fix is needed?

@mumoshu Thanks for having a look! Yes, that's my assessment of the issue. I'll walk you through my analysis.

The core robot code (github.com/lnxjedi/gopherbot) has been running very stable with RTM for many years, but a couple weeks after migrating to socket mode, one of our robots stopped responding to commands, and in the slack app I got a dispatch_failed error when I used the robot's slash command - this led me to suspect the connection had gone bad. Since people were trying to get work done, I didn't take further time to troubleshoot (a netstat would have been good at this point), I just restarted the robot service. When I restarted the robot process, all was well again.

This led me to look at the code for a ping timeout, since I knew slack was sending a ping at regular intervals. To artificially suppress the ping arriving from slack, I used the local firewall to block inbound traffic on the port, then watched the slack logs. It took several minutes for the connection to reset. I repeated the experiment, then sent my SIGUSR1 (forces an abort/stack dump) during the long hang, and found a thread blocking in ReadJSON. Now seeing your comments, I recognize that a full block like this would prevent a graceful tcp connection shutdown. I'm trying to think of a way to better test. Of course, if the vm/container/whatever on the Slack end just goes away, the result would be the same. It would probably be good to reconnect quickly on ping timeout, even in the event that the original connection can't be closed gracefully.

Jul 30 '22 10:07 parsley42

@parsley42 Thanks for your confirmation!

I read your experience and analysis and went back to gorilla/websocket code. I now got to think that conn.Close is no more than just closing TCP connection without coordination with the opposite side of the connection.

https://github.com/gorilla/websocket/blob/af47554f343b4675b30172ac301638d350db34a5/conn.go#L342-L346

ReadJSON may fail immediately only if the opposite of the connection(=the websocket server) send us a WebSocket "Close" message, which doesn't happen in this scenario.

This is a gorilla/websocket issue that describes a not exactly same but related problem: https://github.com/gorilla/websocket/issues/474

If we were to gracefully stop the bot when the Slack's websocket server vm/container/whatever and the TCP connection in-between is still alive, we might be able to start a normal close procedure by, instead of just Closeing it, writing a control message, like c.WriteControl(websocket.CloseMessage, websocket.FormatCloseMessage(websocket.CloseNormalClosure, "") so that the server will hopefully send us a close message that may fail read and its consumer ReadJSON to fail immediately.

https://github.com/gorilla/websocket/blob/af47554f343b4675b30172ac301638d350db34a5/conn.go#L411-L413

But-

if the vm/container/whatever on the Slack end just goes away, the result would be the same

This is really a good point. If the Slack end disappeared without notifying us, this normal close procedure will end up with a read/connection timeout. We need to handle the timeout in a graceful stop scenario anyway.

That said, I believe your fix is perfectly valid and also the best way to handle the timeout issue!

Jul 31 '22 01:07 mumoshu

@kanata2 Can you have a look at this and let me know if you have any questions?

Sep 09 '22 12:09 parsley42

Sorry for slow response. I'll confirm this weekend.

Sep 09 '22 12:09 kanata2

slack slack copied to clipboard

Honor deadman timer, fixes #1093

slack
slack copied to clipboard