ra
ra copied to clipboard
Wait for Ra servers exit in `delete_cluster/{1,2}`
So far, when the delete_cluster/{1,2}
function returned, it meant that the request to stop the cluster was committed to the Ra log and applied. However, the actual stop of the Ra servers and the related supervision tree was handled asynchronously.
This patch uses Erlang monitors to watch the exit of the Ra servers. This way, we ensure that when the function returns, the Ra servers are really gone.
I'm thinking this should be an option rather than the default but happy to discuss that further.
I'm not super sure making Ra cluster deletions a log command was ever strictly necessary in the first place.
It can be an option but I think most would expect it to be on by default.
I added a wait_for_servers_exist_in_delete_cluster
application environment variable to toggle the new behavior I propose, plus a delete_cluster/3
function to let the caller choose the behavior in the arguments.
I forgot to ask before: do you prefer an application environment variable, a function argument or both for this kind of settings?
In fact, I'm not so sure about the behavior of the main
branch and what this patch should do. I mean, currently the function blocks until the command is applied but the servers exit is asynchronous. If we introduce an option to toggle the wait for the servers exit, does it mean the non-wait behavior should treat the Ra command itself as asynchronous too? As a user, if there is an option to wait or not, I would expect the function to be either entirely synchronous (including servers exit) or asynchronous (and not wait for the Ra command). Providing three states (synchronous, synchronous for the command only, or entirely asynchronous) seems too complicated for such a function.
A function option sounds sufficient to me. I would rename wait_for_servers_exit_in_delete_cluster
to wait_for_server_termination_when_deleting_clusters
if we are to keep it.
The Ra leader will not exit until all followers have received the cluster delete command. So this option will never return when trying to delete a cluster where a single member is down. Is this really the behaviour you want?
I think an options argument (wait_for_exits) would be better so that the choice can be made per cluster.
I should have added more context to the issue's description.
Khepri's testsuite did that:
_ = ra:delete_cluster(ServerIds),
_ = supervisor:terminate_child(ra_systems_sup, RaSystem),
_ = supervisor:delete_child(ra_systems_sup, RaSystem),
I tried replacing the supervisor
calls with an application:stop(ra)
. In the end, I just want any Ra state to go away so I can start a fresh cluster for the next testcase.
In both cases, supervisor
logs an error with Erlang 23:
=SUPERVISOR REPORT==== 7-Apr-2022::08:37:55.154657 ===
supervisor: {<0.2764.0>,ra_server_sup}
errorContext: shutdown_error
reason: {shutdown,delete}
offender: [{pid,<0.2765.0>},
{id,async_unset_in_put_test_},
{mfargs,
{ra_server_proc,start_link,
...
(Erlang 24's supervisor kind of ignores the situation and doesn't log anything)
The workaround I use in Khepri's testsuite is what I put in this patch (monitor and wait for Ra servers to exit). To me, it makes sense that when ra:delete_cluster/2
returns, the cluster is gone (including its processes).
If a cluster member is down, I don't know. As you said, perhaps this shouldn't go through an applied command. Perhaps processes should simply delete their state (memory and disk) and terminate.
On the other hand, I don't know if the current behavior is incorrect from a supervision tree point of view: Erlang 23 did log an error, but Erlang 24 doesn't anymore. Perhaps it's not worth the effort. I can keep the workaround in Khepri's testsuite and prepare another patch to give more details in delete_cluster/2
documentation.