WebSocket-Node
WebSocket-Node copied to clipboard
ping/pong keepalive fails for large transfers over slow links
If a large amount of data (say 10M) has been queued for transmission over a slow connection, an outgoing ping can be delayed so long that the grace period timer expires and the connection is closed before the remote end has a chance to respond.
One could of course increase the grace period timeout, but it is counter-intuitive that this should depend on the size of data transfers.
Instead we've tried sending a pong after each received binary frame. A received pong will keep the connection alive with negligible side effects, and this appears to fix the problem.
Comments welcome.
To clarify, are you saying that when you send a large amount of data to the remote peer, it can take so long just to send the ping that the keepalive timeout drops the connection before the remote peer can receive the ping and respond with a pong?
Currently, we do reset the keepalive timeout every time we receive incoming data of any sort, which seems to cover most cases, but I suppose it is plausible that the remote peer never sends any data at all while it's receiving what you're sending, in which case, the server might naïvely assume that the connection is dead.
Have you tried setting the option useNativeKeepalive: true
? That will disable the default behavior of dropping the connection if a pong frame is not received, because it requests that the operating system use the native keepalive facilities of TCP instead of the heavier-weight Websocket ping and pong frames.
I wonder if I should switch the default to be TCP keepalive instead, at this point?
Hi, regarding the failure mechanism, yes that is exactly the problem. The ping is queued to the TCP stream at the right time, but there is so much data queued ahead of it that it is not seen and responded to in time. The bottleneck is probably on the receiving end of the TCP socket since the end device is quite slow and is decrypting the data and streaming it to a flash file system. The problem is be solved at the websocket layer by sending the pongs every so often (e.g. after every received binary frame). I haven't looked into whether this breaks WebSockets protocol conformance, but the cost seems low.
We've considered using the TCP keepalive, but this has the disadvantage that it won't work end-to-end across proxies, load balancers etc. All in all I think it would make sense to fix this weakness in the WebSocket layer keepalive, rather than to fall back to something else.
Ok. As for your approach, unsolicited pong frames are explicitly allowed in the WebSocket RFC, so there's no protocol violation.
Another possible approach would be to maintain our own buffer of outgoing packets and pay attention to the backpressure on the socket, so that we can allow certain control frames such as PONG to jump the line.
Yes we considered that option too, but it seemed more complex to implement. For the moment we're using the pong fix to meet a tight deadline, but I'll be glad to return to this again so we can settle on the best fix.
On Tue, Jun 7, 2016 at 9:03 PM, Brian McKelvey [email protected] wrote:
Another possible approach would be to maintain our own buffer of outgoing packets and pay attention to the backpressure on the socket, so that we can allow certain control frames such as PONG to jump the line.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/theturtle32/WebSocket-Node/issues/229#issuecomment-224381549, or mute the thread https://github.com/notifications/unsubscribe/ACJRbsdku-jW4lm1NnSqaTnQYhbqtRffks5qJcB4gaJpZM4IdE0z .
Just a quick note that useNativeKeepalive is not mentioned in the API documentation - can it be added?