server icon indicating copy to clipboard operation
server copied to clipboard

Slow clients slow down the whole broker

Open alexsporn opened this issue 3 years ago • 5 comments

We are using the MQTT broker and publishing messages directly to all clients using the broker's Publish() func. This func adds a new publish packet to the inlineMessages.pub buffered channel (size 1024) and the inlineClient() loop will publish those packets to all subscribed clients. For each subscribed client this will call client.WritePacket() which in the end will call Write() on the clients writer.

If a single subscribed client is too slow, the clients write buffer will fill up and the whole inlineClient() loop will hang until this client's buffer has space again (see awaitEmpty inside Write()). Shortly after the inlineMessages.pub buffered channel will fill up and further calls to Publish() will hang.

This means a single slow client (even one using QoS 0 with no guarantees of receiving packets) can make the whole broker wait indefinitely and not deliver any more packets to any client.

A possible workaround for this could be to instead of waiting for the buffer to be freed, to just return a "client buffer full" error and skip sending the packet to this client. If the client is using QoS 1/2 the inflight message retry mechanism should try to re-deliver the message.

What do you think? I can write a PR with this changes. Or do you have a better solution to this problem?

alexsporn avatar Sep 01 '22 13:09 alexsporn

Hi @alexsporn! This is very interesting - the possibility never occurred to me.

Currently I am inclined to think the best solution is the one you have described:

  1. If the buffer is full, then writing the message should fail with the error message to the embedding platform.
  2. The QOS of the inline-publisher is always 2 (exactly once), so we don't have to modify how this is handled.
  3. If the QOS of the receiving client subscription is 1/2, then the message should be added to the client's inflight messages queue.

Perhaps we should also make the buffer size for inline publish an value in server.Options. @alexsporn what's your use case which triggered this?

In the meantime I have increased the buffer to 4096 in v1.3.2 👍🏻

mochi-co avatar Sep 02 '22 20:09 mochi-co

Hi @mochi-co , thanks for looking in to the issue.

We are using MQTT over WebSocket as a Pub/Sub mechanism to listen to messages processed by our node software. We faced some issues on one of the nodes running the MQTT broker which has a JavaScript-client (QoS 0) always connected and receives all unfiltered messages (between 50-300 a second). Due to this client slowing down the broker and blocking the Publish() function from enqueueing any more messages, the node started to slow down itself and not process any more messages.

Initially I thought it could be an issue in how we handle the incoming messages and publish them, so I went on to reproduce the bug. Using a JavaScript client (https://github.com/mqttjs/MQTT.js), publishing about 2000 packets a second and forcing the client to sleep between incoming messages to simulate slow processing of each packet, I could reproduce the MQTT broker lockup. Normally I'd say this would be no issue, but this can be used as a Denial-Of-Service attack on public brokers.

With the proposed change, the slow QoS 0 client will not influence any other connected clients and slow down the broker. As soon as the slow client clears up enough buffer it will start receiving messages again.

If the slow client is using QoS 1/2 this opens up another "attack vector" to the broker. If a long InflightTTL is used (defaults to 24 hours), then you can force the memory usage of the broker to quickly go up by using a couple of slow clients. All pending packets will stay in the inflight messages queue.

I totally understand that QoS 1/2 give certain guarantees on how MQTT behaves, but a slow client should not influence the brokers performance. Maybe we need a max count of inflight messages per client?

What do you think?

alexsporn avatar Sep 06 '22 08:09 alexsporn

Hi @alexsporn, thanks for your comprehensive reply :) My apologies for not replying to this earlier, I have been very busy lately...

I absolutely agree with all of the issues you've highlighted here, and have been trying to think about the best way to handle this and ensure we don't create any unintended consequences.

I plan to look into it more thoroughly between now and the weekend if I get some time, but tentatively I think the correct (even expected) behaviour would be to drop the packet if the QOS is 0 and the client buffer is fully, otherwise to add it to the inflight queue. This should apply to both inline-message publishing by the embedding service, and also when a client publishes to the broker and the message is delegated out to subscribing clients.

A brief reminder of the code suggests that writing to clients is blocking (in as much as we wait to write to the client's buffer if it's full). This makes me suspect that a client publishing to a topic with many subscribers could theoretically block until all clients are iterated, which is not ideal. I will have a think about how we might alleviate this bottleneck.

mochi-co avatar Sep 07 '22 22:09 mochi-co

@alexsporn I merged your recent PR, can you try pulling down master and seeing it the problem still exists? :) Thank you!

mochi-co avatar Sep 10 '22 18:09 mochi-co

@alexsporn I've reverted #97 and reopened this issue as the solution for #97 causes the broker to stall (as per #101) under heavy load. I believe this may be related to the broker dropping acks if the queue is full rather than waiting.

mochi-co avatar Sep 11 '22 21:09 mochi-co

This issue has been resolved in v2.0.0

mochi-co avatar Dec 10 '22 22:12 mochi-co