nsq icon indicating copy to clipboard operation
nsq copied to clipboard

nsqd: support draining messages / removing nsqd from rotation

Open jehiah opened this issue 3 years ago • 6 comments

This adds support for a nsqd mode where messages are drained to facilitate removing a nsqd instance from a cluster.

New nsqd API Endpoints

  • PUT /state/shutdown
  • PUT /state/drain
  • POST /topic/drain?topic=...

New nsqd CLI options (also settable via config file)

  • --sigterm-mode=shutdown (default - existing behavior) and --sigterm-mode=drain (new option)

Resolves #1302

jehiah avatar Nov 25 '20 14:11 jehiah

RFR @mreiferson @ploxiln - this is ready for a review pass.

I'm pretty happy with how this came out, but still have another pass to make to expand some test coverage.

jehiah avatar Dec 22 '20 22:12 jehiah

It seems like it would be simpler to just dis-allow publishing messages. Allow creating channels like normal, and don't bother deleting any before regular shutdown. Just return error for any message publish request. The only other bit of logic needed is the check on each FIN if this channel is now completely empty, and then if it was the last one.

ploxiln avatar Dec 29 '20 01:12 ploxiln

@ploxiln that functionality would be similar but i'm not sure it would be easier to implement because you then need a new method of propagating FIN's back up where this piggy backs on the delete functionality. - https://github.com/nsqio/nsq/issues/1302#issuecomment-733466899 discusses the tradeoffs between the two and this is the functionality i'm interested in (to remove a node from rotation) - Can you think of a compelling use case for the other, or where it becomes materially different?

jehiah avatar Dec 29 '20 02:12 jehiah

https://github.com/nsqio/nsq/issues/1302#issuecomment-733466899

If you are trying to remove a nsqd instance from rotation where that nsqd instance had 10 different topics, but just one or two with notable backlogs, it would be desirable to have the topics that drain quickly deleted. Deleting promotes a better cluster hygiene where you don't have a nsqd instance which is no longer getting messages on a topic still getting consumer connections where it causes RDY to be spread thin. (i.e. think a topic that takes a day to drain in some odd circumstance.)

I see.

I didn't have a need for the other behavior, of keeping the topics/channels until exit, I just think it may be simpler to implement, and sufficient. Fewer global state checks, fewer new error cases. There was some mess with ephemeral topics/channels being deleted ... though it looks like this case may not be so bad.

ploxiln avatar Dec 29 '20 03:12 ploxiln

@mreiferson PTAL - i think this is the main outstanding discussion to resolve - https://github.com/nsqio/nsq/pull/1305#discussion_r550717925

jehiah avatar Jan 19 '21 15:01 jehiah

(Looks like this needs a rebase)

I was debating the rebase mid-review to pickup upstream go-svc changes 🤷

As @ploxiln mentioned, there are a lot of global state checks. I'm still trying to map it all out, but it feels like we're missing something.

Open to ideas

I am also still unhappy with the nil checks in the revised Get*Topic paths, especially the implicit knowledge that call sites now have that it means draining (i.e. why not return that error?).

ok; updated.

jehiah avatar Mar 01 '21 04:03 jehiah