nsqjs
nsqjs copied to clipboard
Redistribute RDY among NSQD connections
I think it would be great to have an option for nsqjs that could redistribute the RDY count between multiple connected nsqds based on their channel depths. I frequently run into the problem that I need to set a hard "global" max-in-flight value, but I'm running multiple nsqds and sometimes just one of them will be bursting while the others don't get any messages.
The same thing was discussed in go-nsq, unfortunately without an implementation yet. Just food for thought!
https://github.com/nsqio/go-nsq/issues/179 https://github.com/nsqio/go-nsq/pull/277
RDY management as pointed out is tricky but doable. I agree that it's not entirely satisfactory and it would be nice to provide some options on how that's handled between connections.
Considering channel depths would be one heuristic for allocating RDY counts. The biggest issue is that nsq protocol gives no information about channel depths -- that would have to be requested over HTTP to the nsqd. Not exactly ideal.
An alternative consideration would be to allocate RDY count based on the rate of messages coming from nsqds. This would allow reallocating RDY count to a busy channel provided the other nsqds are essentially idle. I assume this would cover your situation?
I'm open to making changes to allow different RDY strategies.
Thanks for the prompt reply, and thanks for explaining the intricacies involved, I wasn't aware that the nsq protocol doesn't expose channel depth. I wanted to bring this up, but I won't be able to implement it unfortunately.
The approach of measuring rate of messages seems like a compromise, but I could also see it as problematic. If all RDY counts are assigned to a single nsqd, you still have to go around the other "idle" nsqd frequently and set their RDY count to at least 1. I'm sure there are some scenarios where this will have unexpected / oscillating behaviour, but can't really articulate those right now.
If the max_in_flight
is less than the number of nsqd instances, then you have the situation where the client has to switch between nsqd connections. Usually, this can should be considered to be a bad configuration/architecture.
While thinking through alternate strategies for redistributing RDY across nsqd instances, it did occur to me "idle" connections could be treated specially. Instead, "idle" connections no matter what would always have a RDY count of 1. If that connection receives a message, it could requeue the message and then rebalance the RDY counts across non-idle nsqd connections. Effectively, it allows the client to peak for messages.
The benefit is that in both the "bad configuration" described above as well as other alternate RDY count allocation strategies, the client wouldn't have to burn RDY counts on idle connections or have to deal with added latency for connections temporarily set with RDY to 0.
There are a couple of downsides:
- requeue would increment the number of attempts on the message. This is problematic for clients who care about the number of attempts.
- requeue also places the message at the back of the channel queue, as I understand the nsqd implementation. Ordering is not guaranteed but just seems undesirable.
@mreiferson Thoughts?