ra icon indicating copy to clipboard operation
ra copied to clipboard

Dropped pipeline command during leadership transfer

Open JesseStimpson opened this issue 2 years ago • 5 comments

[Edit: I am using ra v1.1.9]

While transferring leadership, if the 'leaving' leader receives a command in await_condition state, the command is dropped. In my trace, I identified ra_server:transfer_leadership_condition/2 as the end-of-the-line for the command message.

Here is an excerpt of custom logging I added to ra to debug this issue:

13:34:57.674  [error] ra_server_proc:await_condition cast {command,low,{'$usr',{claim,"leadership_transfer_while_async_claiming_0",<0.452.0>,false,1627061697674263447},{notify,-576460752303423422,<0.359.0>}}}
13:34:57.674  [error] ra_server:transfer_leadership_condition {command,low,{'$usr',{claim,"leadership_transfer_while_async_claiming_0",<0.452.0>,false,1627061697674263447},{notify,-576460752303423422,<0.359.0>}}}
13:34:57.674  [error] ra_server:handle_await_condition (false clause)

There are no logs after this point that include the details from my command nor the correlation id.

To produce this result, I issued a ra:pipeline_command with a correlation id directly after a call to transfer_leadership. The command is not processed by the ra server, so my calling process does not receive an ra_event message. I think a rejected ra_event message would be preferred in this case, as my calling process would be able to react accordingly.

Note: As far as I know, this behavior is only possible using the ra:transfer_leadership/2 function. I don't have any reason to believe a similar bug would exist during automated leader election routines.

Thanks!

JesseStimpson avatar Jul 23 '21 17:07 JesseStimpson

Thank you for the details.

michaelklishin avatar Jul 23 '21 18:07 michaelklishin

@JesseStimpson if you received the rejected notification what would you do?

kjnilsson avatar Jul 27 '21 10:07 kjnilsson

@kjnilsson In our application, we're using ra as a distributed process registry. Upon receiving a rejected event the process to-be-registered is stopped, and other routines in the application would restart it at a future time, not unlike the standard supervisor pattern.

Presumably this restart would happen when the transfer leadership is complete, and the process would be able to register successfully at that time.

When there is no rejected event received, the process to-be-registered is in an unknown state and therefore must implement a timeout.

JesseStimpson avatar Jul 28 '21 01:07 JesseStimpson

I'd argue that any command using a pipelined command with a correlation needs to implement a timeout. How else would you know to ever retry?

That said we could reject commands in this state but for you use case I am not convinced it will be enough for all possible states the system can be in.

kjnilsson avatar Mar 23 '22 16:03 kjnilsson

That's a fair point. A retry on timeout is still the safest practice. This behavior isn't impacting our use of ra, so we are not blocked in any way.

When we were testing explicit leadership transfers, this seemed like a nice case to use a 'rejected' message since the previous leader is still up to reject it, but I agree it's not a blocker to using ra.

JesseStimpson avatar Mar 23 '22 17:03 JesseStimpson

Hello, I am going to close this issue. I see that there have been several changes in between 1.1.9 and 2.7.0 related to leadership transfer and the await_condition state. It seems possible that the issue has been fixed -- at the very least my report is probably no longer correct for the latest code.

(For the record, we have been using ra in production with great results and will continue to do so!)

JesseStimpson avatar Nov 28 '23 19:11 JesseStimpson