tyrus Unexplainable WebSocket error

trafficstars

The server works perfectly fine, until at a random moment, it starts throwing hundreds of these per second:

Aug 09, 2016 10:54:08 AM org.glassfish.tyrus.container.grizzly.client.GrizzlyWriter$WriterCondition$1 onError WARNING: Connection closed java.io.IOException: Connection closed at org.glassfish.grizzly.asyncqueue.TaskQueue.onClose(TaskQueue.java:331) at org.glassfish.grizzly.nio.AbstractNIOAsyncQueueWriter.onClose(AbstractNIOAsyncQueueWriter.java:501) at org.glassfish.grizzly.nio.transport.TCPNIOTransport.closeConnection(TCPNIOTransport.java:402) at org.glassfish.grizzly.nio.NIOConnection.doClose(NIOConnection.java:647) at org.glassfish.grizzly.nio.NIOConnection$6.run(NIOConnection.java:613) at org.glassfish.grizzly.nio.DefaultSelectorHandler$RunnableTask.run(DefaultSelectorHandler.java:495) at org.glassfish.grizzly.nio.DefaultSelectorHandler.processPendingTaskQueue(DefaultSelectorHandler.java:301) at org.glassfish.grizzly.nio.DefaultSelectorHandler.processPendingTasks(DefaultSelectorHandler.java:290) at org.glassfish.grizzly.nio.DefaultSelectorHandler.preSelect(DefaultSelectorHandler.java:101) at org.glassfish.grizzly.nio.SelectorRunner.doSelect(SelectorRunner.java:335) at org.glassfish.grizzly.nio.SelectorRunner.run(SelectorRunner.java:279) at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:591) at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:571) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at org.glassfish.grizzly.nio.transport.TCPNIOTransport.read(TCPNIOTransport.java:597) at org.glassfish.grizzly.nio.transport.TCPNIOTransportFilter.handleRead(TCPNIOTransportFilter.java:75) at org.glassfish.grizzly.filterchain.TransportFilter.handleRead(TransportFilter.java:173) at org.glassfish.grizzly.filterchain.ExecutorResolver$9.execute(ExecutorResolver.java:119) at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeFilter(DefaultFilterChain.java:284) at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeChainPart(DefaultFilterChain.java:201) at org.glassfish.grizzly.filterchain.DefaultFilterChain.execute(DefaultFilterChain.java:133) at org.glassfish.grizzly.filterchain.DefaultFilterChain.process(DefaultFilterChain.java:112) at org.glassfish.grizzly.ProcessorExecutor.execute(ProcessorExecutor.java:77) at org.glassfish.grizzly.nio.transport.TCPNIOTransport.fireIOEvent(TCPNIOTransport.java:526) at org.glassfish.grizzly.strategies.AbstractIOStrategy.fireIOEvent(AbstractIOStrategy.java:112) at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.run0(WorkerThreadIOStrategy.java:117) at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.access$100(WorkerThreadIOStrategy.java:56) at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy$WorkerThreadRunnable.run(WorkerThreadIOStrategy.java:137) ... 3 more

The seconds previous were just normal:

Environment

RAM 8GB Debian 7.5

Affected Versions

[1.12, 1.13]

Aug 09 '16 15:08 glassfishrobot

Issue Imported From: https://github.com/tyrus-project/tyrus/issues/606
Original Issue Raised By:@glassfishrobot
Original Issue Assigned To: @PetrJanouch

Feb 10 '18 20:02 glassfishrobot

@glassfishrobot Commented Reported by soylomass

Aug 09 '16 15:08 glassfishrobot

@glassfishrobot Commented soylomass said: I pressed submit by error, continue here:

The previous seconds were just normal:

ADD MASS 0.5 ADD XP 1.0 10:54:07 SERVER ID #0 COUNTS BOX 3000 TANKS 15 PERSONS 1 SOCKETS 21 OPEN SOCKETS 21ENDCOUNTS ADD MASS 0.5 10:54:07 SERVER ID #1 COUNTS BOX 3000 TANKS 10 PERSONS 0 SOCKETS 12 OPEN SOCKETS 12ENDCOUNTS ADD XP 1.0 ADD XP 1.0 REMOVE HEALTH ADD XP 1.0 Close connection for client:

Unknown macro: {9de6dad0-36dc-4c4a-8a35-9aa7ac2cda7e}

ON CLOSE ADD MASS 0.5 ADD XP 1.0 Aug 09, 2016 10:54:08 AM org.glassfish.tyrus.container.grizzly.client.GrizzlyWriter$WriterCondition$1 onError WARNING: Connection closed java.io.IOException: Connection closed at org.glassfish.grizzly.asyncqueue.TaskQueue.onClose(TaskQueue.java:331) at org.glassfish.grizzly.nio.AbstractNIOAsyncQueueWriter.onClose(AbstractNIOAsyncQueueWriter.java:501) at org.glassfish.grizzly.nio.transport.TCPNIOTransport.closeConnection(TCPNIOTransport.java:402) at org.glassfish.grizzly.nio.NIOConnection.doClose(NIOConnection.java:647) at org.glassfish.grizzly.nio.NIOConnection$6.run(NIOConnection.java:613) at org.glassfish.grizzly.nio.DefaultSelectorHandler$RunnableTask.run(DefaultSelectorHandler.java:495)

And then, 896 errors were thrown in just one second, and it continued like this for minutes.

This usually happens without consequences, as the connected users can still send and receive data, but sometimes it causes the server to stop communicating.

I am sending the data using getAsyncRemote(), and I usually send them in separate Threads too. (I have a pool of CPU cores * 2 [8] threads I use to handle incoming messages and to prepare the outcoming game data).

Aug 09 '16 15:08 glassfishrobot

@glassfishrobot Commented @pavelbucek said: Can you share a reproducer?

Aug 09 '16 15:08 glassfishrobot

@glassfishrobot Commented soylomass said: I wish I could. It happens randomly.

I've changed the async sends for basic instances to see if it stops happening. I'll report.

Aug 09 '16 16:08 glassfishrobot

@glassfishrobot Commented soylomass said: Basic of course went even worse, as it isn't efficient to handle a lot of writes.

Pavel, do you have any idea of what could be causing hundreds of those exceptions per second at random times? Do you have any idea of something I could try? My project is stuck until I fix this, and I've tried everything.

Aug 11 '16 17:08 glassfishrobot

@glassfishrobot Commented @pavelbucek said: You could test with different settings related to selector and worker thread pools, see https://tyrus.java.net/apidocs/1.13/org/glassfish/tyrus/container/grizzly/server/GrizzlyServerContainer.html

But I'm not sure whether that will help. The problem seems to be dropped (closed) TCP connections and also it seems like presented exceptions are only consequence, not a cause. Are you monitoring number of open files / sockets on your server? Isn't it close to limits (see ulimit -a on *nix based systems + sysctl -a, etc..).

Aug 12 '16 11:08 glassfishrobot

@glassfishrobot Commented soylomass said: Is there any equation to get the recommended number of selector/worker threads? The software is a MMO game sending and receiving from every client more than 10 times per seconds.

By the way, I've set a timeout in the threads who prepare the data for clients and then send it with getAsyncRemote().send, and while the exceptions keep ocurring, the timeout stop it and saves the server from crashing. It's strange, as the exception should be ocurring in the grizzly thread, not in mine (which I'm terminating)

Aug 13 '16 18:08 glassfishrobot

@glassfishrobot Commented @PetrJanouch said: Hi,

my humble opinion is that this has nothing to do with threads. Grizzly has a write queue whose content it feeds to a socket when the socket is ready for writing. The logged error means that Grizzly was not able to send the queued data because the the peer (the client, a load balancer or a proxy if there is one) has closed the TCP connection (look at the java.io.EOFException in the cause).

So I think that you are looking at the wrong side of the connection and the key to solving this is finding out why the connection was closed by the other side. Btw. By closing the connection I mean that the connection was closed on the TCP level and not gracefully using the Websocket closing handshake.

The only think that puzzles me is that you say that the server becomes unresponsive after this happens. The only reason why Grizzly can become unresponsive is when all its worker threads are blocked somewhere (the exception you posted above wouldn't cause a Grizzly thread to be blocked). When the server becomes unresponsive do a thread dump and see what are Grizzly Worker threads doing or post it here if you cannot figure out anything.

Aug 14 '16 17:08 glassfishrobot

@glassfishrobot Commented soylomass said: What puzzles me is why the exception, once it happens, repeats itself hundreds of times in a second, if I don't make that many writes to a single client.

The exception doesn't make the server crash anymore now that I've implemented a timeout in the threads where the getAsyncRemote().send occur, but the exceptions keep happening randomly.

Aug 14 '16 18:08 glassfishrobot

@glassfishrobot Commented @PetrJanouch said: That is strange. The exception is logged once per closed connection and it will be also returned by the future/javax.websocket.SendResult of each uncompleted asynchronously written message.

Are you sure the hundreds of logged messages belong to one connection and many connections are not closed at the same time for whatever reason?

Aug 14 '16 19:08 glassfishrobot

@glassfishrobot Commented soylomass said: I would like to check that. I should use Future.get() to know that, right? In that case, the operation will be similar in performance to sending a syncronous write, as it would wait to the write operation to complete?

Aug 14 '16 19:08 glassfishrobot

@glassfishrobot Commented soylomass said: EDIT: I've checked the logs once more, and the exceptions start to happen always after a user disconnects (as it's catched by the onClose callback):

Close connection for client:

{80c02da2-e649-4edf-9a1e-6fa70591aef0}

ON CLOSE Aug 14, 2016 3:51:19 PM org.glassfish.tyrus.container.grizzly.client.GrizzlyWriter$WriterCondition$1 onError WARNING: Connection closed

So I don't think it has to do with many connections being closed at the same time. But it's still strange why the exception happens more time than the quantity of writes made to that user.

Aug 14 '16 19:08 glassfishrobot

@glassfishrobot Commented @PetrJanouch said: I have just figured out what happened. The message can be really logged enormous amount of times per one closed connection. I will fix it tomorrow morning CET and also a snapshot with the fix should be available tomorrow.

Thanks for reporting this. I will keep you updated about the fix.

Aug 14 '16 20:08 glassfishrobot

@glassfishrobot Commented soylomass said: Nice!. I'll look forward for it. Thanks a lot for the quick responses.

Aug 14 '16 23:08 glassfishrobot

@glassfishrobot Commented @PetrJanouch said: OK. Fixed. You can try 1.14-SNAPSHOT or 2.0-SNAPSHOT from java.net maven repo https://maven.java.net/content/repositories/snapshots/org/glassfish/tyrus/ Please, let us know how it worked for you.

Aug 15 '16 11:08 glassfishrobot

@glassfishrobot Commented soylomass said: I've updated to 1.14 snapshot and will report the results. What's the difference in 2.0?

Aug 16 '16 03:08 glassfishrobot

@glassfishrobot Commented @pavelbucek said: 2.0 is Java 8, 1.14 is Java 7.

But in practice, they are equal - same features, same code. Please use 1.14, since we are not yet releasing 2.x and it's not clear when that will happen.

Aug 16 '16 06:08 glassfishrobot

@glassfishrobot Commented soylomass said: Ok.

Can confirm the 1.14 snapshot fixes the problem (the exception now appears only once). Thanks to both of you for the quick help and support.

Aug 16 '16 16:08 glassfishrobot

@glassfishrobot Commented This issue was imported from java.net JIRA TYRUS-429

Apr 25 '17 03:04 glassfishrobot

tyrus tyrus copied to clipboard

Unexplainable WebSocket error

Environment

Affected Versions

tyrus
tyrus copied to clipboard