java-slack-sdk icon indicating copy to clipboard operation
java-slack-sdk copied to clipboard

Socket-mode application become a zombie when unhandled exception occurs in single-threaded thread pool

Open ddovgal opened this issue 4 years ago • 11 comments

In our application, we use Bolt Socket-mode SDK for the bot. And we've found some strange behavior. Sometimes (looks like it doesn't depend on anything) app becomes a "zombie", meaning that it continues to work but the bot isn't in a working state. For example, let's say for the message with text ping it will respond with a pong message, and in that "non-working" state bot won't react to the ping message. But at the same time app isn't "dead" or crashed, we can successfully "stop" SocketModeApp later. According to our observations, this always occurs after javax.websocket.DeploymentException

(we use SocketModeClientTyrusImpl)

I suspect (but still can be wrong 😅) this is due to the facts that:

  • we have a single-threaded ThreadPool-s for com.slack.api.socket_mode.SocketModeClient#initializeSessionMonitorExecutor and com.slack.api.socket_mode.SocketModeClient#initializeMessageProcessorExecutor. Both create TP-s by Executors.newSingleThreadScheduledExecutor
  • here are two exceptions logs that show unhandled exception occurrences:
    • [2021-08-27T16:43:47,600Z](socket-mode-session-monitor-worker-36)([]) trace_id= ERROR - SocketModeClient - Failed to reconnect to Socket Mode server: javax.websocket.DeploymentException: Handshake error.
      ! org.glassfish.tyrus.core.HandshakeException: Response code was not 101: 408.
      ! at org.glassfish.tyrus.client.TyrusClientEngine.processResponse(TyrusClientEngine.java:299)
      ! at org.glassfish.tyrus.container.grizzly.client.GrizzlyClientFilter.handleHandshake(GrizzlyClientFilter.java:322)
      ! at org.glassfish.tyrus.container.grizzly.client.GrizzlyClientFilter.handleRead(GrizzlyClientFilter.java:291)
      ! at org.glassfish.grizzly.filterchain.ExecutorResolver$9.execute(ExecutorResolver.java:119)
      ! at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeFilter(DefaultFilterChain.java:283)
      ! at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeChainPart(DefaultFilterChain.java:200)
      ! at org.glassfish.grizzly.filterchain.DefaultFilterChain.execute(DefaultFilterChain.java:132)
      ! at org.glassfish.grizzly.filterchain.DefaultFilterChain.process(DefaultFilterChain.java:111)
      ! at org.glassfish.grizzly.ProcessorExecutor.execute(ProcessorExecutor.java:77)
      ! at org.glassfish.grizzly.nio.transport.TCPNIOTransport.fireIOEvent(TCPNIOTransport.java:536)
      ! at org.glassfish.grizzly.strategies.AbstractIOStrategy.fireIOEvent(AbstractIOStrategy.java:112)
      ! at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.run0(WorkerThreadIOStrategy.java:117)
      ! at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.access$100(WorkerThreadIOStrategy.java:56)
      ! at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy$WorkerThreadRunnable.run(WorkerThreadIOStrategy.java:137)
      ! at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:591)
      ! at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:571)
      ! ... 1 common frames omitted
      ! Causing: javax.websocket.DeploymentException: Handshake error.
      ! at org.glassfish.tyrus.client.ClientManager$3$1.run(ClientManager.java:658)
      ! at org.glassfish.tyrus.client.ClientManager$3.run(ClientManager.java:696)
      ! at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      ! at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      ! at org.glassfish.tyrus.client.ClientManager$SameThreadExecutorService.execute(ClientManager.java:849)
      ! at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:118)
      ! at org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:493)
      ! at org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:355)
      ! at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.connect(SocketModeClientTyrusImpl.java:170)
      ! ... 9 common frames omitted
      ! Causing: java.io.IOException: javax.websocket.DeploymentException: Handshake error.
      ! at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.connect(SocketModeClientTyrusImpl.java:173)
      ! at com.slack.api.socket_mode.SocketModeClient.connectToNewEndpoint(SocketModeClient.java:67)
      ! at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.maintainCurrentSession(SocketModeClientTyrusImpl.java:137)
      ! at com.slack.api.socket_mode.SocketModeClient.lambda$initializeSessionMonitorExecutor$1(SocketModeClient.java:180)
      ! at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      ! at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
      ! at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
      ! at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      ! at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      ! at java.base/java.lang.Thread.run(Thread.java:834)
      
    • [2021-08-27T17:06:35,818Z](socket-mode-message-processor-worker-37)([]) trace_id= ERROR - SocketModeClient - Failed to reconnect to Socket Mode server: javax.websocket.DeploymentException: Handshake response not received.
      ! javax.websocket.DeploymentException: Handshake response not received.
      ! at org.glassfish.tyrus.client.ClientManager$3$1.run(ClientManager.java:678)
      ! at org.glassfish.tyrus.client.ClientManager$3.run(ClientManager.java:696)
      ! at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      ! at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      ! at org.glassfish.tyrus.client.ClientManager$SameThreadExecutorService.execute(ClientManager.java:849)
      ! at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:118)
      ! at org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:493)
      ! at org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:355)
      ! at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.connect(SocketModeClientTyrusImpl.java:170)
      ! ... 9 common frames omitted
      ! Causing: java.io.IOException: javax.websocket.DeploymentException: Handshake response not received.
      ! at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.connect(SocketModeClientTyrusImpl.java:173)
      ! at com.slack.api.socket_mode.SocketModeClient.connectToNewEndpoint(SocketModeClient.java:67)
      ! at com.slack.api.socket_mode.SocketModeClient.processMessage(SocketModeClient.java:350)
      ! at com.slack.api.socket_mode.SocketModeClient.lambda$initializeMessageProcessorExecutor$0(SocketModeClient.java:157)
      ! at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      ! at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
      ! at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
      ! at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      ! at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      ! at java.base/java.lang.Thread.run(Thread.java:834)
      
  • after an unhandled exception occurred it kills one thread in TP and we lose one important functionality
    • recreate a new session for the first presented log case
    • consume\process messages for the second presented log case

The Slack SDK version

[INFO] +- com.slack.api:bolt:jar:1.8.1:compile
[INFO] |  +- com.slack.api:slack-api-model:jar:1.8.1:compile
[INFO] |  +- com.slack.api:slack-api-client:jar:1.8.1:compile
[INFO] |  \- com.slack.api:slack-app-backend:jar:1.8.1:compile
[INFO] +- com.slack.api:bolt-socket-mode:jar:1.8.1:compile
[INFO] |  +- com.slack.api:bolt:jar:1.8.1:runtime
[INFO] |  |  +- com.slack.api:slack-api-model:jar:1.8.1:runtime
[INFO] |  |  +- com.slack.api:slack-api-client:jar:1.8.1:runtime
[INFO] |  |  \- com.slack.api:slack-app-backend:jar:1.8.1:runtime
[INFO] |  +- com.slack.api:bolt-socket-mode:jar:1.8.1:runtime
[INFO]    +- com.slack.api:bolt:jar:1.8.1:compile
[INFO]    |  +- com.slack.api:slack-api-model:jar:1.8.1:compile
[INFO]    |  +- com.slack.api:slack-api-client:jar:1.8.1:compile
[INFO]    |  \- com.slack.api:slack-app-backend:jar:1.8.1:compile
[INFO]    +- com.slack.api:bolt-socket-mode:jar:1.8.1:compile
[INFO] |  |  +- com.slack.api:bolt:jar:1.8.1:runtime
[INFO] |  |  |  +- com.slack.api:slack-api-model:jar:1.8.1:runtime
[INFO] |  |  |  +- com.slack.api:slack-api-client:jar:1.8.1:runtime
[INFO] |  |  |  \- com.slack.api:slack-app-backend:jar:1.8.1:runtime
[INFO] |  |  +- com.slack.api:bolt-socket-mode:jar:1.8.1:runtime

OS info

Inside the openjdk11:jdk-11.0.7_10-alpine-slim

Steps to reproduce:

  • can't say for sure. It just happens after javax.websocket.DeploymentExceptions. Please see my assumptions and presented logs.

Expected result:

The app is "usable", able to react to events in a common way after occurred javax.websocket.DeploymentExceptions

Actual result:

The app is "unusable". Like a zombie, unable to react to events in a common way after occurred javax.websocket.DeploymentExceptions, but isn't totally crashed.

ddovgal avatar Sep 01 '21 11:09 ddovgal

In addition to the second presented log (message-processor-worker thread):

  • before presented exception log, ~20 seconds before that there was another exception, that I think is related to message-processor-worker thread failure:
[2021-08-27T17:06:13,929Z](socket-mode-session-cleaner-worker-800)([]) trace_id= ERROR - SocketModeClient - Failed to reconnect to Socket Mode server: javax.websocket.DeploymentException: Handshake error.
! org.glassfish.tyrus.core.HandshakeException: Response code was not 101: 408.
! at org.glassfish.tyrus.client.TyrusClientEngine.processResponse(TyrusClientEngine.java:299)
! at org.glassfish.tyrus.container.grizzly.client.GrizzlyClientFilter.handleHandshake(GrizzlyClientFilter.java:322)
! at org.glassfish.tyrus.container.grizzly.client.GrizzlyClientFilter.handleRead(GrizzlyClientFilter.java:291)
! at org.glassfish.grizzly.filterchain.ExecutorResolver$9.execute(ExecutorResolver.java:119)
! at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeFilter(DefaultFilterChain.java:283)
! at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeChainPart(DefaultFilterChain.java:200)
! at org.glassfish.grizzly.filterchain.DefaultFilterChain.execute(DefaultFilterChain.java:132)
! at org.glassfish.grizzly.filterchain.DefaultFilterChain.process(DefaultFilterChain.java:111)
! at org.glassfish.grizzly.ProcessorExecutor.execute(ProcessorExecutor.java:77)
! at org.glassfish.grizzly.nio.transport.TCPNIOTransport.fireIOEvent(TCPNIOTransport.java:536)
! at org.glassfish.grizzly.strategies.AbstractIOStrategy.fireIOEvent(AbstractIOStrategy.java:112)
! at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.run0(WorkerThreadIOStrategy.java:117)
! at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.access$100(WorkerThreadIOStrategy.java:56)
! at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy$WorkerThreadRunnable.run(WorkerThreadIOStrategy.java:137)
! at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:591)
! at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:571)
! ... 1 common frames omitted
! Causing: javax.websocket.DeploymentException: Handshake error.
! at org.glassfish.tyrus.client.ClientManager$3$1.run(ClientManager.java:658)
! at org.glassfish.tyrus.client.ClientManager$3.run(ClientManager.java:696)
! at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
! at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
! at org.glassfish.tyrus.client.ClientManager$SameThreadExecutorService.execute(ClientManager.java:849)
! at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:118)
! at org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:493)
! at org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:355)
! at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.connect(SocketModeClientTyrusImpl.java:170)
! ... 22 common frames omitted
! Causing: java.io.IOException: javax.websocket.DeploymentException: Handshake error.
! at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.connect(SocketModeClientTyrusImpl.java:173)
! at com.slack.api.socket_mode.SocketModeClient.connectToNewEndpoint(SocketModeClient.java:67)
! at com.slack.api.socket_mode.SocketModeClient.runCloseListenersAndAutoReconnectAsNecessary(SocketModeClient.java:257)
! at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.onClose(SocketModeClientTyrusImpl.java:242)
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
! at java.base/java.lang.reflect.Method.invoke(Method.java:566)
! at org.glassfish.tyrus.core.AnnotatedEndpoint.callMethod(AnnotatedEndpoint.java:552)
! at org.glassfish.tyrus.core.AnnotatedEndpoint.onClose(AnnotatedEndpoint.java:567)
! at org.glassfish.tyrus.core.AnnotatedEndpoint.onClose(AnnotatedEndpoint.java:576)
! at org.glassfish.tyrus.core.TyrusEndpointWrapper.onClose(TyrusEndpointWrapper.java:1235)
! at org.glassfish.tyrus.core.TyrusWebSocket.onClose(TyrusWebSocket.java:110)
! at org.glassfish.tyrus.core.ProtocolHandler.close(ProtocolHandler.java:481)
! at org.glassfish.tyrus.core.TyrusWebSocket.close(TyrusWebSocket.java:244)
! at org.glassfish.tyrus.core.TyrusWebSocket.close(TyrusWebSocket.java:254)
! at org.glassfish.tyrus.core.TyrusRemoteEndpoint.close(TyrusRemoteEndpoint.java:480)
! at org.glassfish.tyrus.core.TyrusSession.close(TyrusSession.java:210)
! at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.closeSession(SocketModeClientTyrusImpl.java:296)
! at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.lambda$setCurrentSession$1(SocketModeClientTyrusImpl.java:278)
! at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
! at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
! at java.base/java.lang.Thread.run(Thread.java:834)

ddovgal avatar Sep 01 '21 12:09 ddovgal

Hello @ddovgal, thanks for writing this in. This is an interesting scenario. I will summarize my observations and perhaps that can help in identifying a) what is happening, b) what, if anything, we can do to work around this issue in the short term and c) how to mitigate the problem in the long term. Identifying what is happening is key to address the issue properly.

In all three scenarios / stack traces you posted, we think the exception arises from this java-slack-sdk code area.

Quick clarifying question: is your third log case scenario presented in your last comment in combination with the second log case scenario presented in your original post? I want to make sure I understand the timeline of the stack traces properly, and it seems if the third log case is what precedes / causes the second log case, because the stack traces for the third and first are identical, it may be the same single underlying cause for the issue.

As you mentioned in the first log case, based on the end of the stack trace, it seems a reconnection is in process (based on the execution of maintainCurrentSession). BTW, do you see any log outputs matching the logging from maintainCurrentSession? During the reconnection process in this first log case, it seems like the underlying Tyrus implementation expects an HTTP 101 response code as part of the WebSocket protocol upgrade negotiation, but in this scenario, the Slack API (I believe in this scenario the java-slack-sdk is attempting to connect to the response returned by the Slack apps.connections.open API) returns an HTTP 408 Request Timed Out. I wonder why this may be happening?

In general, I am curious as to your reasoning on why you think this has to do with threading? May you elaborate more on this point?

As for a workaround or mitigation, it seems like Tyrus allows for the definition of retry behaviour, and allows for the use of our own reconnect handler. Perhaps that is one avenue we can explore for this scenario?

filmaj avatar Sep 01 '21 16:09 filmaj

@ddovgal Thanks for sharing the details (really helpful!)

but in this scenario, the Slack API (I believe in this scenario the java-slack-sdk is attempting to connect to the response returned by the Slack apps.connections.open API) returns an HTTP 408 Request Timed Out. I wonder why this may be happening?

I'm also curious about this part. As far as I know, we may sometimes see "HTTP 408 Request Timed Out" when the app is behind a proxy/corporate firewall.

Even if an app is not behind a proxy, some connectivity issue can arise in between the Slack server-side and your app hosting infra. According to the information you've shared in the description, the app is running in a Docker container. Thus, another possibility is that some connectivity issue between the app container and outside might be happening.

As @filmaj suggested, Tyrus's retry option may help. With that being said, I'm wondering why the reconnection stops working in this scenario. Once we figure out the room for improvement on this SDK side, we are happy to quickly release the fix for it. Your continued help for the investigation would be greatly appreciated! 🙇

seratch avatar Sep 02 '21 03:09 seratch

Sorry for such a delay in response. I'm currently in a vacation, so will be able to give more details on this case only on next week.

Hope It will occurs again in this week, so I'll be able to gather more information from logs.

ddovgal avatar Sep 08 '21 09:09 ddovgal

Hi! I have similar issue with Socket Mode app written in kotlin on top of Spring Framwork. It occurs rather often right from very beginning of socket mode introduction. Collected logs were not helpful, so now my app is running in trace level for "org.glassfish.tyrus." and in debug for "com.slack.api.socket_mode." Hopefully it will reveal some new details.

Is there any way to intercept event of such connection drop? Such event could be used to mark app as unhealthy.

Havelock-Vetinari avatar Nov 25 '21 16:11 Havelock-Vetinari

Hi @Havelock-Vetinari,

ERROR - SocketModeClient - Failed to reconnect to Socket Mode server: javax.websocket.DeploymentException: Handshake error. ! org.glassfish.tyrus.core.HandshakeException: Response code was not 101: 408.

Are you seeing this error log too? If so, the handshake error should be happening inside the Tyrus library. As I mentioned in my last comment, this can be a kind of connectivity issue between your app and the Slack server-side.

Is there any way to intercept event of such connection drop? Such event could be used to mark app as unhealthy.

If the close listeners do not work for you, there is nothing else as far as I know.

I've never reproduced the situation described here on my end. My understanding of the issue described here is still that, due to the inactive WebSocket connection, the SocketModeApp is no longer able to receive any data from Slack. If we can make the app reconnect to the Slack server-side in some way, the app should work again even in this scenario (because other parts like thread pool for message handling are still working -- awaiting for incoming messages).

As @filmaj suggested, Tyrus's retry option may help.

If Tyrus's retry option is useful for you, we may consider enabling the retry option by default. Can anyone try this option?

seratch avatar Nov 25 '21 21:11 seratch

Hi @seratch!

Are you seeing this error log too? If so, the handshake error should be happening inside the Tyrus library. As I mentioned in my last comment, this can be a kind of connectivity issue between your app and the Slack server-side.

My application was running on log level=DEBUG for com.slack.api.socket_mode.SocketModeClient and there was simply silence in logs when issue occurred. Even entries about session maintenance were gone.

Now I have enabled more detailed logs for Tyrus so I'm waiting for next occurrence of this issue.

I will also check close listener and retry option in Tyrus.

Havelock-Vetinari avatar Nov 26 '21 08:11 Havelock-Vetinari

👋 It looks like this issue has been open for 30 days with no activity. We'll mark this as stale for now, and wait 10 days for an update or for further comment before closing this issue out. If you think this issue needs to be prioritized, please comment to get the thread going again! Maintainers also review issues marked as stale on a regular basis and comment or adjust status if the issue needs to be reprioritized.

github-actions[bot] avatar Jan 10 '22 00:01 github-actions[bot]

Hi! Issue with loosing socket mode connection still exists. Enabling TRACE on Tyrus and SDK packages didn't help. Last message in logs was single "ping" from my application without "pong" received and without any new attempt maintain websocket session.

There is similar issue in python Slack SDK: https://github.com/slackapi/python-slack-sdk/issues/1110

Havelock-Vetinari avatar Jan 11 '22 08:01 Havelock-Vetinari

Hey @Havelock-Vetinari,

Thanks for the response! I've updated this issue to be ignored by our auto-triage bot.

Aside from enabling more logging, did you have an opportunity to try:

  1. Using the close listener?
  2. Tyrus retry option?

Since we’ve been unable to replicate this on our end, it's challenging to resolve the problem. This certainly looks like a legitimate issue and I believe @seratch is eager to mitigate the issue - so he may follow-up with more questions. In the meantime, any additional information that you have provide would be very helpful!

mwbrooks avatar Jan 11 '22 23:01 mwbrooks

I faced a similar issue and upgrading tyrus and bolt did not help. When i inspect the thread dump, i realized a bug on JDK SSLEngine causes this behaviour. Upgrading openjdk 11.02 to 17 .02 immediately fixed the problem.

see https://bugs.openjdk.org/browse/JDK-8241054

icelik avatar Aug 17 '22 10:08 icelik