sockjs-node
sockjs-node copied to clipboard
SockJS is not emitting 'close' event
I have a chat (facebook style chat) than during the day it gets over 3000 open sockets (online users).
But during the night, I have a maximum of 400 users online, yet the chat (with sockjs) accuses that it still has, more or less, 2000 open sockets.
If I restart the server (browser will automatically reconnect) than my chat server back to having only 400 sockets (which proves that 2000 connections were incorrect).
In developing this bug does not exist, only in production ...
What can I do to find out what's going on?
How are you quantifying the number of open sockets / connected users? Have you checked the number of TCP connections open on the server level?
The reason it could only exist in production is you may not be replicating the user behavior / browser that is causing the incorrect socket count.
-Ryan
I'll check the number of TCP connections next time...
to quantify (a very small peace):
var sockCount = {};
io.on('connection', function(conn) {
var username = null;
conn.on('data', function(jsonData) {
var data = JSON.parse(jsonData || "{}");
if (data.channel === 'aim:auth' && helper.validateUserToken(data.username, data.token)) {
username = data.username;
sockCount[data.username] = (sockCount[data.username] || 0) + 1;
conn.write(helper.toJSON({channel: 'ready', message: "ok"}));
return;
}
if (!username) {
conn.close(42, 'not authentificated');
return;
}
if (data.channel !== 'aim:auth') {
conn.emit(data.channel, {original: data, user: username});
}
});
conn.on('close', function() {
if (username && sockCount[username] && typeof sockCount[username] === 'number') {
sockCount[username] = sockCount[username] - 1;
if (sockCount[username] < 1) delete sockCount[username];
}
});
});
So, when I print "sockCount", I have something like:
user1, 1 open socket user2, 10 open sockets user3, 3 open sockets
total user: 1600, total sockets: 3000 ... that, I sure know, is not true.
I've found from my experiences using this in production that incrementing counters on connect / close events isn't the most accurate way.
On my applications that I've wanted an accurate count of users online I have always implemented a heartbeat system. Basically the user sends back a identification packet every x seconds and I store it off in a database like MongoDB (which allows me to expire documents after x seconds) using an upsert. So whenever a user pings back the server will update the heartbeat entry, or create a new one if it's been cleaned up.
From there you can simply query the row to get a count of users online, and even go further to get which types of users are online etc.
"close events isn't the most accurate way."
But what happens? just there moments that 'close' event doesn't execute? Is there someway to reproduce this issue on development side?
I'm really not sure, but from what I've seen it doesn't always fire. Have you ever tried using a non-websocket capable browser that has to fall-back to xhr-polling?
OK, now is the moment that my chat has more then a real online users.
total users: 2304, total sockets: 4370
sysctl
"sysctl" => {
'net.core.somaxconn' => 16384,
'net.core.rmem_max' => 16777216,
'net.core.wmem_max' => 16777216,
'net.ipv4.tcp_rmem' => "4096 87380 16777216",
'net.ipv4.tcp_wmem' => "4096 65536 16777216",
'net.ipv4.tcp_syncookies' => 1,
'file-max' => 999999
}
netstat -a | wc -l => 2110 lsof -p 346 | wc -l => 1850
AFTER A CHAT RESTART (browser re-connect automatically in 10s ... I waited 15min) total users: 1145, total sockets: 1331
netstat -a | wc -l => 2717 lsof -p 3707 | wc -l => 1732
but anyway, this is something that SockJS SHOULD manage internally. It SHOULD manage this heartbeat.
Out of curiosity, can you do breakdown by transports used for all opened connections? Can you tell if it is problem with specific fallback protocol?
One of the possibilities - polling transports use server-side session and they will keep it alive for few seconds between requests. So, there can be situation when there are less TCP connections than number of connected SockJS clients.
On Mon, Jul 8, 2013 at 10:22 AM, Maximiliano Guzenski < [email protected]> wrote:
but anyway, this is something that SockJS SHOULD manage internally. It SHOULD manage this heartbeat.
— Reply to this email directly or view it on GitHubhttps://github.com/sockjs/sockjs-node/issues/127#issuecomment-20589326 .
Now is 5am at Brazil, it is not a issue by few seconds, it is a issue in hours or forever...
90% of my user use a up to data Chrome version, I pretty shore that is not a protocol specific issue. But I'll do more tests.
but maybe, is a issue at my code, I'll share it in gist
No problem, just send your findings when you have them.
I used sockjs-node (and sockjs-tornado) and did not experience problem you're having.
I'll try to help, but I don't even know why it happens.
Serge.
On Mon, Jul 8, 2013 at 10:35 AM, Maximiliano Guzenski < [email protected]> wrote:
Now is 5am at Brazil, it is not a issue by few seconds, it is a issue in hours or forever...
90% of my user use a up to data Chrome version, I pretty shore that is not a protocol specific issue. But I'll do more tests.
but maybe, is a issue at my code, I'll share it in gist
— Reply to this email directly or view it on GitHubhttps://github.com/sockjs/sockjs-node/issues/127#issuecomment-20589715 .
there is it... I removed some code (not related to this issue) https://gist.github.com/maxguzenski/5946916
My input on this:
Users length: 1426 Connections count: 1909 { 'xhr-streaming': 197, 'xhr-polling': 83, websocket: 1626, 'jsonp-polling': 3, eventsource: 0, htmlfile: 0 }
netstat ESTABLISHED sockets: 4066 (i'm behind HAProxy, so devide by 2)
The difference between users count to connection count is due some users connected in multiple tabs / browsers which is ok in my scenario.
@maxguzenski what version of sockjs / node you are using? have you seen #99 you have sockets stuck in some other states?
sockjs-node = 0.3.7 (up to date version) sockjs-client = 0.3.4 (up to date version)
But now, I was thinking... I did have this issue a few months ago. In the last couple months I update nodejs (from 0.10.something to 0.10.12) and socketjs from 0.3.4 to 0.3.7 ... I'll try sockJS 0.3.4 again, before downgrade nodejs
[updated] Sorry, I always use sockJS 0.3.7, since this chat is in production... The only thing that really changes was nodejs version.
Ok, I'm debugging it... now I'm logging everything. I've 1000 connections right now, and at least one hang in state 3 (closed), readable: false and writable: false (websocket protocol, IE10)... and looks like that conn.on('close', ..) wasnt execute.
[updated] My sockJS app has 4 sockets on state "closed" that didn't emite "close" event. In this way, it will take between 700 and 900 "hanging connections" tomorrow morning
Hi In the end of day, my SockJS webchat had more then 100 socks in state 'closed', but it had not fire 'closed' event. This bug occurred less than 5% of all the connections that I had during the day.
To work around this problem, I just made:
setInterval(function() {
var connsIds = Object.keys(conns);
connsIds.each(function(connId) {
if (conns[connId].readyState === 3) delete conns[connId];
}
}, 10*60*1000);
After seeing that issue, I've checked my app, it runs behind SSL haproxy and all the incoming connections go threw it.
I've kill -9 haproxy now, and waited few minutes, even though all the tcp connections on the OS level were dropped, sockjs indeed have not emited close events correctly, and i had following stats: Connections count: 861 Protocol counters: { 'jsonp-polling': 0, 'xhr-streaming': 0, websocket: 861, 'xhr-polling': 0, htmlfile: 0, eventsource: 0 }
Might be a problem with faye implementation
This may be related to the Chrome issue: https://code.google.com/p/chromium/issues/detail?id=76358 but likely part of a larger connectivity/network issue that is causing the close event to be emitted unreliably.
In any case, I have tested my application using Firefox, Safari and Chrome, and all of them display this symptom:
If I close the lid of my machine to put my computer to sleep, the connections are not properly closed, and they remain indefinitely improperly terminated in about 95% of the cases.
It is really unpredictable... sometimes these connections remain in limbo in readyState 3 (most common case) without emitting the close
event, sometimes they remain in a readyState 1, also thus not executing the close
event, and occasionally, for some reason, they happen to emit the close event. But this is rare.
If I trace the contents of my connection map, these connections build up over time, and, as mentioned, lead to a major memory leak/block.
Examining the transport.js code, it seems like there are potentially some conflicting operations that are occurring when a connection "times out". The session, as well as the event listeners for close
and end
are removed by the time the methods that emit them are supposed to be executed.
My hunch is that some of the calls (ie. didClose
, didAbort
, tearDown
, unregister
etc.) are all being called linearly as if they were blocking, when they should be called asynchronously using callbacks to ensure a consistent order of operations on the connection closing processes.
Could someone confirm the reason why callbacks are not used at all in the transports.js script?
I still have this issue. I check for connections list and readyState is CLOSED. I can only reproduce it on production server. I'm using not sock.js, but http://npmjs.com/faye-websocket (0.11.0). As sock.js using faye-websocket too, may be this issue somewhere inside the lib. I implemented woraround from @maxguzenski and now it works
@maxguzenski Did you stick with that workaround or found actual issue?