specification Diagnostic mechanism for stream overrun

Diagnostic mechanism for stream overrun

Open tkurki opened this issue 5 years ago • 14 comments

We recently encountered a situation where a websocket client was disconnecting periodically. This was traced back to the fact that the client was inadvertedly subscribing to data and not reading the incoming stream at all, because the purpose of the client was to send data (a sensor node).

This was awkward to debug, because there was no indication of the congestion.

The same situation may rise when the websocket bandwidth is (in average for a longer period of time) less than what the server => client stream requires. Especially with remote connections this is a real scenario.

A stopgap measure on the server side is to detect this situation and drop the connection if a threshold is exceeded and log an error message. The server can alternatively just stop sending data and drop the connection when everything is sent, with a safeguard timeout for dropping the connection in case the client is stalled or plain dysfunctional.

Resuming transmission after a pause to clear the backlog breaks the assumption that the stream connection (ws, tcp) is reliable and we get all the data we have subscribed, so dropping the connection eventually is probably a better alternative.

From the client's perspective this would look like just a dropped connection and figuring out why is left to whoever oversees the system.

Instead we could have an early warning system and a standardised final message, indicating that there is trouble breeding and in case the client manages to receive the final message get clear indication what happened.

One way to do this would be to send normal Signal K notifications, which might reach the user for an interactive client.

However a distinct message indicating connection health and error conditions would be much more explicit, allowing even a non interactive client such as a sensor to log the message, flash a led or do something explicitly about the problem.

Jan 05 '19 11:01 tkurki

Maybe related: if we need to reboot a server we can just kill the server and drop connections, or we can stop sending data and wait for all the outgoing data to be streamed. For a HA system we should be able to indicate that a client should gracefully disconnect and reconnect (assuming that the reconnect is routed to another server instance).

HTTP/2 has the GOAWAY mechanism to handle graceful shutdowns: https://http2.github.io/http2-spec/#GOAWAY

Jan 05 '19 11:01 tkurki

For context, the node doc says:

net.Socket has the property that socket.write() always works. This is to help users get up and running quickly. The computer cannot always keep up with the amount of data that is written to a socket - the network connection simply might be too slow. Node.js will internally queue up the data written to a socket and send it out over the wire when it is possible. (Internally it is polling on the socket's file descriptor for being writable). The consequence of this internal buffering is that memory may grow. This property shows the number of characters currently buffered to be written. (Number of characters is approximately equal to the number of bytes to be written, but the buffer may contain strings, and the strings are lazily encoded, so the exact number of bytes is not known.) Users who experience large or growing bufferSize should attempt to "throttle" the data flows in their program with socket.pause() and socket.resume().

I believe in this case, not reading the data was a bug of the client. Not a server issue. Streaming is provided by lower level layers and should be relied on. We should not add new messages to try to 'workaround' what is working in this case.

@tkurki said:

The server can alternatively just stop sending data and drop the connection when everything is sent, with a safeguard timeout for dropping the connection in case the client is stalled or plain dysfunctional.

Sounds exactly like what is happening today, no?

Resuming transmission after a pause to clear the backlog breaks the assumption that the stream connection (ws, tcp) is reliable and we get all the data we have subscribed, so dropping the connection eventually is probably a better alternative. From the client's perspective this would look like just a dropped connection and figuring out why is left to whoever oversees the system.

👍

Instead we could have an early warning system and a standardised final message, indicating that there is trouble breeding and in case the client manages to receive the final message get clear indication what happened. One way to do this would be to send normal Signal K notifications, which might reach the user for an interactive client.

In most cases the client will never get it because the send buffer is already full and the connection will be disconnected before the message gets through. In the situation that caused this discussion an additional message would not have helped.

However a distinct message indicating connection health and error conditions would be much more explicit, allowing even a non interactive client such as a sensor to log the message, flash a led or do something explicitly about the problem.

Only if the client gets it. The sensors could already blink if they detect a failure of the connection and they have to reconnect (which they should do automatically I think).

I will strongly advocate in favor of keeping things as simple as possible and this would add a lot of un-needed complexity in my opinion.

If we need to do one thing, besides fixing the client, based on this experience, I would suggest a log message on the server warning that the buffer has gotten big and the connection might be killed soon. We could even suggest subscribing to less keys or using some kind of filter/aggregation to reduce the volume. Purely cosmetics to help debug the issue - not a change to the protocol.

Happy new year @tkurki ;)

Jan 05 '19 11:01 sarfata

I totally agree that this particular situation would not have be solved by a diagnostic message. It just got me thinking about other situations, mainly wifi and cellular connections, whose bandwidth may fluctuate wildly and there is no other way but tracking lag in timestamps to figure out that you are trying to push more data than will flow. And no way to indicate this to the user - would you not want an explicit message about this problem?

The server can just stop sending data and wait for the outgoing diagnostic message to be sent out.

What server exacly does to track and log the situation is not a specification issue. See the PR. Having a mechanism to indicate problems early might be one.

@sarfata what is the complexity you fear? I am thinking about just a simple message, something like

{
  diagnostic: {
     type: "BANDWIDTH",
     message: "Excessive buffering, your connection may be too slow to send all the data you are requesting.."
  }
}

A server can choose not to implement this and a client can just ignore these, but then we would have an explicit mechanism to handle this situation if both support it.

No current client should (but might 😞 ) break on this.

Jan 05 '19 12:01 tkurki

[...] to figure out that you are trying to push more data than will flow. And no way to indicate this to the user - would you not want an explicit message about this problem?

We should look at what other real time streaming systems are doing. First usecase that comes to mind is video: https://en.wikipedia.org/wiki/HTTP_Live_Streaming#Adaptability

To enable a player to adapt to the bandwidth of the network, the original video is encoded in several distinct quality levels. The server serves an index, called a "master playlist", of these encodings, called "variant streams". The player can then choose between the variant streams during playback, changing back and forth seamlessly as network conditions change.

To be honest, video is not exactly like our use-case. In video it's "easy" to know when you are not getting the data fast enough. As you pointed out, in SignalK, detecting this requires the client to compare timestamps with realtime and identify that the data is not being received fast enough.

Still I think we should be inspired by the fact that they chose to do it client side - and not server side.

What server exactly does to track and log the situation is not a specification issue.

I think that is where the complexity will come from. Different servers will behave differently. Some servers will rely on servers doing this and we will start having interop issues.

See the PR. Having a mechanism to indicate problems early might be one.

I had not seen the PR. I think the PR is fine. The only "message" I see in the PR is:

spark.end('Server outgoing buffer overflow, terminating connection')

I think it's totally fine to do it the way you did, outside of the SignalK spec.

I guess this is equivalent to Websocket.close(errorCode, reason).

Jan 05 '19 16:01 sarfata

I disagree with we should be inspired by the fact that they chose to do it client side - and not server side. Afai see there is no way how a client will know that it does not have the available bandwidth. An example to the contrary: I stream from my boat to a cloud server. You stream from the cloud server. My connection is congested and data starts to fall behind anything approximating real time, yet the connection between you and the cloud server has plenty of bandwidth.

Only the server can see that there is more data than it can get to a client.

Jan 05 '19 16:01 tkurki

You make good points.

I am still not a fan of mixing a very "network level" problem in SignalK. Maybe there are some other realtime websocket based protocols that already solve this. We are definitely not the first group to run into this issue.

At the very least, I feel this issue needs more research. I just looked around for a bit but cannot find great examples of how people using websocket are solving this issue (I was looking around games for example).

Jan 05 '19 16:01 sarfata

Sure, I understand the mismatch.

I am no expert on gaming protocols, but I believe there the updates streaming solution is mostly udp.

I added this issue more as a discussion focal point, not something that needs urgent attention.

Jan 05 '19 17:01 tkurki

The non-delivery problem here is a normal case for an asynchronous messaging system. All messages (that matter) are sent and expect explicit message acknowlegements, They may also be sent as persistent messages that will survive the reboot of any of the servers.

The solution here is not to add to the signalk protocol, but to use an appropriate message delivery method.

See https://en.wikipedia.org/wiki/Message-oriented_middleware. Common message brokers are ActiveMQ (and ActiveMQ Artemis), RabbitMQ, Kafka, AWS SMS etc.

One of the issues I have with the reliance on REST and http is the lack of proper async message delivery semantics (eg deliver once and once only), which leads to the problems above.

The Artemis server doesnt suffer from these problems as its a full messaging server with message buffering (queues), message acknowledgements, auto-throttling, and message time-to-live timeouts. It also has sophisticated message inspection, debugging and queue management, including alarms and consequent remedial actions. Thats why it was built on the ActiveMQ Artemis message broker which provides all this for free.

The solution is not to add functionality to the application layer (signalk, or node server in this case) as this inevitably leads to building your own message broker. Easier to incorporate an existing and proven messaging solution in the server implementation.

Once you start to extend signalk data beyond the vessel and into the cloud data will often pass over several transports, and through several intermediate servers. For control and critical messages reliable delivery is much more important.

Jan 05 '19 20:01 rob42

BTW there is no way to send a reliable message at the application layer over ws (basically tcp). Its a classic messaging problem. eg

Bob sends a message to Sam Sam receives the message Sam sends an acknowlegement to Bob Bob receives the acknowlegement

How does Sam know that Bob got the acknowlegement? Sure the tcp was reliable, but did it actually get sent and delivered at the application level. What if the server crashed, or the connection dropped mid process? So:

Bob sends an acknowlegement acknowlegement...and so on.

The obvious solution is for Bob to get explicit acknowlegement from his application layer that the message was accepted to be sent, and then every consequent layer to explicitly get acknowlegement of the next step, AND to do retries and then send errors back up the chain. Congrats, you have just built a messaging server :-)

Jan 05 '19 20:01 rob42

Sure, for reliable messaging messaging you need the proper mechanism.

However this particular issue has nothing to do with reliable messaging, acknowledgements or guaranteed exactly once delivery. The problem I tried to outline above is not non-delivery, but a client falling behind when trying to stream real time data and whether or not the server can somehow inform the client about it. The problem would be mostly the same even with a messaging fabric, just that a broker would have better mechanisms for queuing the messages.

I agree that this issue may very well be something we don't want in the application layer.

As for messaging: I would really love to see SK specification cover how the different messaging standards should be used with Signal K so that not everybody needs to come up with a new way to put the pieces together. We are not making much progress there.

Jan 05 '19 21:01 tkurki

The client falling behind is just one of the issues the messaging fabric solves. At the simplest level it auto-queues the messages, either dropping the oldest or paging them to disk, so it doesnt blow memory. Thats useful for intermittent connection or variable client speeds or reconnections on a different socket.

But the solution to your particular problem is message priority.

The server can see the outgoing queue growth, and can control message priority, so critical messages go first, and it could inform the admin or the client(with a higher priority message) of backlog . It could signal to throttle the producer, or apply a policy (eg time-to-live) to ensure messages are removed in a sensible fashion.

I'm also keen to see the messaging stds mapped. Ive done some work on it in the past, need to find time and revisit. BTW the major lesson learned was that all signalk communication and processes need to be possible using just json text messages. While you can map various things in different systems, its just too complex to get a full mapping to all.

Jan 05 '19 22:01 rob42

FWIW, Nodes definition is an over simplification, network transport issues are an implementation issue, not a given. Phoenix (an Elixir/Erlang based web framework, is capable of handling over 2 million clients without issue) this can and should be solved at the socket implementation level, as this is the signal K specification, this is best left to implementations.

Jan 05 '19 22:01 JonRowe

@tkurki wrote:

This was awkward to debug, because there was no indication of the congestion

If the main reason is debugging, wouldn't a warning to the log do the trick? An issue like this would come up during client development, and a console log would then be picked up by the developer working to fix the bug - no?

If this issue doesn't assert itself until (much later during) runtime, there may be some value in informing the client. However, that would also increase complexity on the client side as the client would have to implement solutions to (various?) failure cases that may not be known at time of development. IMO that added complexity (of building a partly fault-tolerant system) should be avoided whenever possible, and in this case it seems to me that a console log + dropping the connection (which signals to the debugging party that something's wrong anyway) is enough to fix such a bug.

Jan 07 '19 08:01 fabdrol

Signalling the client with a signalk message does add a lot of client complexity, and we should avoid it. (When its part of the messaging protocol its use is more automatic, eg in a cluster of messaging capable servers.)

Dropping the connection when the queue grows beyond a certain point is a nice simple way of notifying the client, we should make that best practice, and assume the client reconnects when it can.

Jan 07 '19 19:01 rob42

specification specification copied to clipboard

Diagnostic mechanism for stream overrun

specification
specification copied to clipboard