khepri
khepri copied to clipboard
khepri_machine: Add command deduplication mechanism
Why
The "client" side of khepri_machine
implemented in process_sync_command/3
have a retry mechanism if ra:process_command/3
returns an error such as noproc
, nodedown
or shutdown
.
However, this retry mechanism can't tell if the state machine already received the command and just couldn't reply, for instance because there is a node stopping or a change of leadership.
Therefore, it's possible that the same command is submitted twice and thus processed twice.
That's ok for idempotent commands, but it may not be alright for all transactions for example. That's why we need a deduplication mechanism that ensures the same command is not applied multiple times.
How
Two new commands are introduced to implement the deduplication system:
-
#dedup{}
which is used to wrap the command to protect and assign a unique reference to it -
#dedup_ack{}
which is used at the end of the retry loop to let the state machine know that the "client" side received the reply
When the state machine receives a command wrapped into a #dedup{}
command, it will remember the reply for the initial processing of that command. For any subsequent copies of the same #dedup{}
(based on the unique reference), the state machine will not apply the wrapped command and will simply returned the reply it remembered from the first application.
Later when the state machine receives a #dedup_ack{}
, it will drop the cached reply for that reference.
Just in case the client never sends a #dedup_ack{}
, the state machine will drop any expired cached entries. The expiration time is based on the command timeout. If it's infinity, it defaults to 15 minutes.
This whole deduplication mechanism can be enabled or disabled through the new protect_against_dups
command option which takes a boolean. This option is off by default, except for R/W transactions.
Thus if the caller knows the transation is idempotent, it can decide to turn the dedup mechanism off.
V2: We now use the effective_machine_version
counter provided by ra_counters:counters/2
if it is available as it is faster than querying the Ra server. If the counter is unavailable, we fall back to the query. The new counter is added by rabbitmq/ra#426 and will be used once a Ra release contains this change.
Codecov Report
Attention: Patch coverage is 90.98361%
with 11 lines
in your changes are missing coverage. Please review.
Project coverage is 89.64%. Comparing base (
a478a9e
) to head (c6465c2
).
Files | Patch % | Lines |
---|---|---|
src/khepri_machine.erl | 92.30% | 8 Missing :warning: |
src/khepri_machine_v0.erl | 82.35% | 3 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #250 +/- ##
==========================================
- Coverage 89.70% 89.64% -0.06%
==========================================
Files 20 21 +1
Lines 3002 3091 +89
==========================================
+ Hits 2693 2771 +78
- Misses 309 320 +11
Flag | Coverage Δ | |
---|---|---|
erlang-24 | 88.67% <90.16%> (+<0.01%) |
:arrow_up: |
erlang-25 | 88.64% <90.98%> (-0.03%) |
:arrow_down: |
erlang-26 | 89.29% <90.98%> (-0.05%) |
:arrow_down: |
os-ubuntu-latest | 89.58% <90.98%> (+<0.01%) |
:arrow_up: |
os-windows-latest | 89.38% <90.98%> (-0.06%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I need to bump the state machine version and convert the state.
The pull request is ready for review. The machine version is taken into account, so if the machine is too old, the public API won’t use the dedup mechanism.
I tested this patch successfully with rolling_upgrade_SUITE
in the shovel plugin (the testsuite was failing quite easily without the deduplication).