tyrus
tyrus copied to clipboard
Unexplainable WebSocket error
The server works perfectly fine, until at a random moment, it starts throwing hundreds of these per second:
Aug 09, 2016 10:54:08 AM org.glassfish.tyrus.container.grizzly.client.GrizzlyWriter$WriterCondition$1 onError WARNING: Connection closed java.io.IOException: Connection closed at org.glassfish.grizzly.asyncqueue.TaskQueue.onClose(TaskQueue.java:331) at org.glassfish.grizzly.nio.AbstractNIOAsyncQueueWriter.onClose(AbstractNIOAsyncQueueWriter.java:501) at org.glassfish.grizzly.nio.transport.TCPNIOTransport.closeConnection(TCPNIOTransport.java:402) at org.glassfish.grizzly.nio.NIOConnection.doClose(NIOConnection.java:647) at org.glassfish.grizzly.nio.NIOConnection$6.run(NIOConnection.java:613) at org.glassfish.grizzly.nio.DefaultSelectorHandler$RunnableTask.run(DefaultSelectorHandler.java:495) at org.glassfish.grizzly.nio.DefaultSelectorHandler.processPendingTaskQueue(DefaultSelectorHandler.java:301) at org.glassfish.grizzly.nio.DefaultSelectorHandler.processPendingTasks(DefaultSelectorHandler.java:290) at org.glassfish.grizzly.nio.DefaultSelectorHandler.preSelect(DefaultSelectorHandler.java:101) at org.glassfish.grizzly.nio.SelectorRunner.doSelect(SelectorRunner.java:335) at org.glassfish.grizzly.nio.SelectorRunner.run(SelectorRunner.java:279) at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:591) at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:571) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at org.glassfish.grizzly.nio.transport.TCPNIOTransport.read(TCPNIOTransport.java:597) at org.glassfish.grizzly.nio.transport.TCPNIOTransportFilter.handleRead(TCPNIOTransportFilter.java:75) at org.glassfish.grizzly.filterchain.TransportFilter.handleRead(TransportFilter.java:173) at org.glassfish.grizzly.filterchain.ExecutorResolver$9.execute(ExecutorResolver.java:119) at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeFilter(DefaultFilterChain.java:284) at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeChainPart(DefaultFilterChain.java:201) at org.glassfish.grizzly.filterchain.DefaultFilterChain.execute(DefaultFilterChain.java:133) at org.glassfish.grizzly.filterchain.DefaultFilterChain.process(DefaultFilterChain.java:112) at org.glassfish.grizzly.ProcessorExecutor.execute(ProcessorExecutor.java:77) at org.glassfish.grizzly.nio.transport.TCPNIOTransport.fireIOEvent(TCPNIOTransport.java:526) at org.glassfish.grizzly.strategies.AbstractIOStrategy.fireIOEvent(AbstractIOStrategy.java:112) at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.run0(WorkerThreadIOStrategy.java:117) at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.access$100(WorkerThreadIOStrategy.java:56) at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy$WorkerThreadRunnable.run(WorkerThreadIOStrategy.java:137) ... 3 more
The seconds previous were just normal:
Environment
RAM 8GB Debian 7.5
Affected Versions
[1.12, 1.13]
- Issue Imported From: https://github.com/tyrus-project/tyrus/issues/606
- Original Issue Raised By:@glassfishrobot
- Original Issue Assigned To: @PetrJanouch
@glassfishrobot Commented Reported by soylomass
@glassfishrobot Commented soylomass said: I pressed submit by error, continue here:
The previous seconds were just normal:
ADD MASS 0.5 ADD XP 1.0 10:54:07 SERVER ID #0 COUNTS BOX 3000 TANKS 15 PERSONS 1 SOCKETS 21 OPEN SOCKETS 21ENDCOUNTS ADD MASS 0.5 10:54:07 SERVER ID #1 COUNTS BOX 3000 TANKS 10 PERSONS 0 SOCKETS 12 OPEN SOCKETS 12ENDCOUNTS ADD XP 1.0 ADD XP 1.0 REMOVE HEALTH ADD XP 1.0 Close connection for client:
Unknown macro: {9de6dad0-36dc-4c4a-8a35-9aa7ac2cda7e}
ON CLOSE ADD MASS 0.5 ADD XP 1.0 Aug 09, 2016 10:54:08 AM org.glassfish.tyrus.container.grizzly.client.GrizzlyWriter$WriterCondition$1 onError WARNING: Connection closed java.io.IOException: Connection closed at org.glassfish.grizzly.asyncqueue.TaskQueue.onClose(TaskQueue.java:331) at org.glassfish.grizzly.nio.AbstractNIOAsyncQueueWriter.onClose(AbstractNIOAsyncQueueWriter.java:501) at org.glassfish.grizzly.nio.transport.TCPNIOTransport.closeConnection(TCPNIOTransport.java:402) at org.glassfish.grizzly.nio.NIOConnection.doClose(NIOConnection.java:647) at org.glassfish.grizzly.nio.NIOConnection$6.run(NIOConnection.java:613) at org.glassfish.grizzly.nio.DefaultSelectorHandler$RunnableTask.run(DefaultSelectorHandler.java:495)
And then, 896 errors were thrown in just one second, and it continued like this for minutes.
This usually happens without consequences, as the connected users can still send and receive data, but sometimes it causes the server to stop communicating.
I am sending the data using getAsyncRemote(), and I usually send them in separate Threads too. (I have a pool of CPU cores * 2 [8] threads I use to handle incoming messages and to prepare the outcoming game data).
@glassfishrobot Commented @pavelbucek said: Can you share a reproducer?
@glassfishrobot Commented soylomass said: I wish I could. It happens randomly.
I've changed the async sends for basic instances to see if it stops happening. I'll report.
@glassfishrobot Commented soylomass said: Basic of course went even worse, as it isn't efficient to handle a lot of writes.
Pavel, do you have any idea of what could be causing hundreds of those exceptions per second at random times? Do you have any idea of something I could try? My project is stuck until I fix this, and I've tried everything.
@glassfishrobot Commented @pavelbucek said: You could test with different settings related to selector and worker thread pools, see https://tyrus.java.net/apidocs/1.13/org/glassfish/tyrus/container/grizzly/server/GrizzlyServerContainer.html
But I'm not sure whether that will help. The problem seems to be dropped (closed) TCP connections and also it seems like presented exceptions are only consequence, not a cause. Are you monitoring number of open files / sockets on your server? Isn't it close to limits (see ulimit -a on *nix based systems + sysctl -a, etc..).
@glassfishrobot Commented soylomass said: Is there any equation to get the recommended number of selector/worker threads? The software is a MMO game sending and receiving from every client more than 10 times per seconds.
By the way, I've set a timeout in the threads who prepare the data for clients and then send it with getAsyncRemote().send, and while the exceptions keep ocurring, the timeout stop it and saves the server from crashing. It's strange, as the exception should be ocurring in the grizzly thread, not in mine (which I'm terminating)
@glassfishrobot Commented @PetrJanouch said: Hi,
my humble opinion is that this has nothing to do with threads. Grizzly has a write queue whose content it feeds to a socket when the socket is ready for writing. The logged error means that Grizzly was not able to send the queued data because the the peer (the client, a load balancer or a proxy if there is one) has closed the TCP connection (look at the java.io.EOFException in the cause).
So I think that you are looking at the wrong side of the connection and the key to solving this is finding out why the connection was closed by the other side. Btw. By closing the connection I mean that the connection was closed on the TCP level and not gracefully using the Websocket closing handshake.
The only think that puzzles me is that you say that the server becomes unresponsive after this happens. The only reason why Grizzly can become unresponsive is when all its worker threads are blocked somewhere (the exception you posted above wouldn't cause a Grizzly thread to be blocked). When the server becomes unresponsive do a thread dump and see what are Grizzly Worker threads doing or post it here if you cannot figure out anything.
@glassfishrobot Commented soylomass said: What puzzles me is why the exception, once it happens, repeats itself hundreds of times in a second, if I don't make that many writes to a single client.
The exception doesn't make the server crash anymore now that I've implemented a timeout in the threads where the getAsyncRemote().send occur, but the exceptions keep happening randomly.
@glassfishrobot Commented @PetrJanouch said: That is strange. The exception is logged once per closed connection and it will be also returned by the future/javax.websocket.SendResult of each uncompleted asynchronously written message.
Are you sure the hundreds of logged messages belong to one connection and many connections are not closed at the same time for whatever reason?
@glassfishrobot Commented soylomass said: I would like to check that. I should use Future.get() to know that, right? In that case, the operation will be similar in performance to sending a syncronous write, as it would wait to the write operation to complete?
@glassfishrobot Commented soylomass said: EDIT: I've checked the logs once more, and the exceptions start to happen always after a user disconnects (as it's catched by the onClose callback):
Close connection for client:
{80c02da2-e649-4edf-9a1e-6fa70591aef0}
ON CLOSE Aug 14, 2016 3:51:19 PM org.glassfish.tyrus.container.grizzly.client.GrizzlyWriter$WriterCondition$1 onError WARNING: Connection closed
So I don't think it has to do with many connections being closed at the same time. But it's still strange why the exception happens more time than the quantity of writes made to that user.
@glassfishrobot Commented @PetrJanouch said: I have just figured out what happened. The message can be really logged enormous amount of times per one closed connection. I will fix it tomorrow morning CET and also a snapshot with the fix should be available tomorrow.
Thanks for reporting this. I will keep you updated about the fix.
@glassfishrobot Commented soylomass said: Nice!. I'll look forward for it. Thanks a lot for the quick responses.
@glassfishrobot Commented @PetrJanouch said: OK. Fixed. You can try 1.14-SNAPSHOT or 2.0-SNAPSHOT from java.net maven repo https://maven.java.net/content/repositories/snapshots/org/glassfish/tyrus/ Please, let us know how it worked for you.
@glassfishrobot Commented soylomass said: I've updated to 1.14 snapshot and will report the results. What's the difference in 2.0?
@glassfishrobot Commented @pavelbucek said: 2.0 is Java 8, 1.14 is Java 7.
But in practice, they are equal - same features, same code. Please use 1.14, since we are not yet releasing 2.x and it's not clear when that will happen.
@glassfishrobot Commented soylomass said: Ok.
Can confirm the 1.14 snapshot fixes the problem (the exception now appears only once). Thanks to both of you for the quick help and support.
@glassfishrobot Commented This issue was imported from java.net JIRA TYRUS-429