dragonfly icon indicating copy to clipboard operation
dragonfly copied to clipboard

Consider implementing a dragonfly stop command

Open ashotland opened this issue 2 years ago • 12 comments

Doing version upgrade failover tests we encountered the following scenario

Calling shutdown on the master it restarts quickly and the client reconnects to it

We'd like a stop command that will make dragonfly stop accepting request and respond with stopped error to all commands.

Alternatives:

  • Develop a way for the controlplane to systemctl stop dragonfly.service
  • Stop the host

If we go for this, replica take over should also terminate in stop rather than shutdown.

ashotland avatar Jun 20 '23 11:06 ashotland

After discussing with @ashotland the stop command will stop all open connections and will reject new connections

adiholden avatar Jun 20 '23 12:06 adiholden

I'm stating the obvious here, but there won't be any way to remotely kill the server. I guess that's acceptable.

chakaz avatar Jun 20 '23 12:06 chakaz

Or do you plan to keep the admin port active?

chakaz avatar Jun 20 '23 12:06 chakaz

Redis has CLIENT PAUSE [WRITE] with established semantics. Do you think they're good enough or do we have a separate use case?

royjacobson avatar Jun 21 '23 11:06 royjacobson

I discussed it with @ashotland offline. He is primarily worried about a scenario where a server becomes unresponsive.

  1. First, we agreed that it's an unrealistic scenario, because Dragonfly does not have any bugs.
  2. Secondly, even if it had a tiny bug, that for an improbable reason would render the server unresponsive, we could not rely on the server to obey this "STOP" command being unresponsive that is. So in any case it's not very useful.

romange avatar Jun 21 '23 13:06 romange

@romange I think you are mixing between the HA/fail-over scenario and version upgrade (see the description).

This issue is about the later, where we need to proactively stop the master. How do you suggest that we disconnect and prevent connections in this context?

ashotland avatar Jun 21 '23 14:06 ashotland

What's the final condition for "replica takover" command @royjacobson ? We do close all the connections to maintain consistency. Do we allow them to be reopened again? I would expect that the master reaches a limbo state where it does not allow new connections on the primary port or alternatively send a custom "MOVED" error even for non-cluster mode.

romange avatar Jun 21 '23 14:06 romange

@romange, According to the design doc the master performs shutdown.

But with our systemd settings clients can reconnect quickly after restart.

We should also consider what to do until replica take over is ready for prime time. Or may have other scenarios where we'll want to stop a master.

@royjacobson - I am not sure that https://redis.io/commands/client-pause/ is suitable as it doesn't seem to close connections or block new connections (but maybe I misunderstand the documentation).

ashotland avatar Jun 21 '23 19:06 ashotland

I just do not think that STOP is the right tool, for example for systemd restarts: now you need to race with other clients to configure your restarted instance. Also, replica takeover should be ready soon (right @royjacobson ? ) In any case, as Roy said, "CLIENT PAUSE" is the way to do it but I still doubt it covers your cases reliably.

romange avatar Jun 21 '23 19:06 romange

Now I read your last sentence. yes, Redis behavior does not cover new connections but we could change these semantics to fit our needs. In any case, I am still not persuaded that a command is the way to go.

romange avatar Jun 21 '23 20:06 romange

In the MVP and the current implementation of replica takover which is soon to be finished master shuts down. The problem is that after it will shut down the systemd will restart the server. So I believe we should instead shutting down the server just close all connections and not allow new connections, this should be the final step of the replica takeover in master side and not a command to be issued from the control plane, as you said there will be a race between clients that will reconnect to the server when its loading and us calling the stop command. In additions we could add a command that does this logic, but I dont think it should be part of the replica takover flow.

adiholden avatar Jun 22 '23 06:06 adiholden

OK, I am fine with punting on the new command for now and just changing the replica take over end state.

ashotland avatar Jun 22 '23 06:06 ashotland

per our discussion, closing the issue

romange avatar Jul 10 '23 03:07 romange