ra
ra copied to clipboard
Dropped pipeline command during leadership transfer
[Edit: I am using ra v1.1.9]
While transferring leadership, if the 'leaving' leader receives a command in await_condition
state, the command is dropped. In my trace, I identified ra_server:transfer_leadership_condition/2
as the end-of-the-line for the command message.
Here is an excerpt of custom logging I added to ra to debug this issue:
13:34:57.674 [error] ra_server_proc:await_condition cast {command,low,{'$usr',{claim,"leadership_transfer_while_async_claiming_0",<0.452.0>,false,1627061697674263447},{notify,-576460752303423422,<0.359.0>}}}
13:34:57.674 [error] ra_server:transfer_leadership_condition {command,low,{'$usr',{claim,"leadership_transfer_while_async_claiming_0",<0.452.0>,false,1627061697674263447},{notify,-576460752303423422,<0.359.0>}}}
13:34:57.674 [error] ra_server:handle_await_condition (false clause)
There are no logs after this point that include the details from my command nor the correlation id.
To produce this result, I issued a ra:pipeline_command
with a correlation id directly after a call to transfer_leadership. The command is not processed by the ra server, so my calling process does not receive an ra_event
message. I think a rejected
ra_event
message would be preferred in this case, as my calling process would be able to react accordingly.
Note: As far as I know, this behavior is only possible using the ra:transfer_leadership/2
function. I don't have any reason to believe a similar bug would exist during automated leader election routines.
Thanks!
Thank you for the details.
@JesseStimpson if you received the rejected
notification what would you do?
@kjnilsson In our application, we're using ra as a distributed process registry. Upon receiving a rejected event the process to-be-registered is stopped, and other routines in the application would restart it at a future time, not unlike the standard supervisor pattern.
Presumably this restart would happen when the transfer leadership is complete, and the process would be able to register successfully at that time.
When there is no rejected event received, the process to-be-registered is in an unknown state and therefore must implement a timeout.
I'd argue that any command using a pipelined command with a correlation needs to implement a timeout. How else would you know to ever retry?
That said we could reject commands in this state but for you use case I am not convinced it will be enough for all possible states the system can be in.
That's a fair point. A retry on timeout is still the safest practice. This behavior isn't impacting our use of ra, so we are not blocked in any way.
When we were testing explicit leadership transfers, this seemed like a nice case to use a 'rejected' message since the previous leader is still up to reject it, but I agree it's not a blocker to using ra.
Hello, I am going to close this issue. I see that there have been several changes in between 1.1.9 and 2.7.0 related to leadership transfer and the await_condition state. It seems possible that the issue has been fixed -- at the very least my report is probably no longer correct for the latest code.
(For the record, we have been using ra in production with great results and will continue to do so!)