nsq
nsq copied to clipboard
nsqd: support draining messages / removing nsqd from rotation
This adds support for a nsqd
mode where messages are drained to facilitate removing a nsqd
instance from a cluster.
New nsqd
API Endpoints
-
PUT /state/shutdown
-
PUT /state/drain
-
POST /topic/drain?topic=...
New nsqd
CLI options (also settable via config file)
-
--sigterm-mode=shutdown
(default - existing behavior) and--sigterm-mode=drain
(new option)
Resolves #1302
RFR @mreiferson @ploxiln - this is ready for a review pass.
I'm pretty happy with how this came out, but still have another pass to make to expand some test coverage.
It seems like it would be simpler to just dis-allow publishing messages. Allow creating channels like normal, and don't bother deleting any before regular shutdown. Just return error for any message publish request. The only other bit of logic needed is the check on each FIN if this channel is now completely empty, and then if it was the last one.
@ploxiln that functionality would be similar but i'm not sure it would be easier to implement because you then need a new method of propagating FIN's back up where this piggy backs on the delete functionality. - https://github.com/nsqio/nsq/issues/1302#issuecomment-733466899 discusses the tradeoffs between the two and this is the functionality i'm interested in (to remove a node from rotation) - Can you think of a compelling use case for the other, or where it becomes materially different?
https://github.com/nsqio/nsq/issues/1302#issuecomment-733466899
If you are trying to remove a nsqd instance from rotation where that nsqd instance had 10 different topics, but just one or two with notable backlogs, it would be desirable to have the topics that drain quickly deleted. Deleting promotes a better cluster hygiene where you don't have a nsqd instance which is no longer getting messages on a topic still getting consumer connections where it causes RDY to be spread thin. (i.e. think a topic that takes a day to drain in some odd circumstance.)
I see.
I didn't have a need for the other behavior, of keeping the topics/channels until exit, I just think it may be simpler to implement, and sufficient. Fewer global state checks, fewer new error cases. There was some mess with ephemeral topics/channels being deleted ... though it looks like this case may not be so bad.
@mreiferson PTAL - i think this is the main outstanding discussion to resolve - https://github.com/nsqio/nsq/pull/1305#discussion_r550717925
(Looks like this needs a rebase)
I was debating the rebase mid-review to pickup upstream go-svc changes 🤷
As @ploxiln mentioned, there are a lot of global state checks. I'm still trying to map it all out, but it feels like we're missing something.
Open to ideas
I am also still unhappy with the nil checks in the revised Get*Topic paths, especially the implicit knowledge that call sites now have that it means draining (i.e. why not return that error?).
ok; updated.