[Crashed] trigger repl_curr_off!= master_repl_offset assert failure when having pending write cases
It is a master/slave use case, when there is a pending write situation, we are seeing the following assert failure:
replication.cpp:5714 'replica->repl_curr_off != g_pserver->master_repl_offset' is not true
This problem only occurs when there are pending client writes. If there are pending write operations, and when repl_curr happens reset to have the same value as repl_offset at a moment, it triggers the assert failure. This was discovered when we ran a migration use case. The details are as follows:
- Load some cache on a slave/master server
- Configure a second slave (a python script is executed for configuring and starting a second slave), etc.)
- A master server crash occurs since there are pending writes when the code is validating the assertion above.
The fix is to wait until all pending writes have been completed before performing "ProcessPendingAsyncWrites". (see the MR).
Here is the part of the stacktrace collected from the master server:
:16:M 24 Oct 2022 11:00:11.086 audit: receive "psync ea81fbf568320eb1bfbf5219940390f363c8cceb 1" from trusted network, client: 172.31.45.11:30385(:0), db:0 1:16:M 24 Oct 2022 11:00:11.086 * Replica 172.31.45.11:6379 asks for synchronization 1:16:M 24 Oct 2022 11:00:11.086 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for 'ea81fbf568320eb1bfbf5219940390f363c8cceb', my replication IDs are '3637bc8d4bedc4c5fe3e118da629b05e1552230c' and '0000000000000000000000000000000000000000') 1:16:M 24 Oct 2022 11:00:11.086 * Starting BGSAVE for SYNC with target: disk 1:16:M 24 Oct 2022 11:00:11.088 * Background saving started 1:16:M 24 Oct 2022 11:00:11.088 * flushReplBacklogToClients: replica->repl_curr_off=11473417814; g_pserver->master_repl_offset=11473417814
=== KEYDB BUG REPORT START: Cut & paste starting from here === 1:16:M 24 Oct 2022 11:00:11.088 # === ASSERTION FAILED === 1:16:M 24 Oct 2022 11:00:11.088 # ==> replication.cpp:5714 'replica->repl_curr_off != g_pserver->master_repl_offset' is not true
------ STACK TRACE ------
Backtrace: /opt/redis/redis-server 0.0.0.0:6379(flushReplBacklogToClients()+0x2bf) [0x62f5cf] /opt/redis/redis-server 0.0.0.0:6379(processCommand(client*, int)+0xde5) [0x67dac5] /opt/redis/redis-server 0.0.0.0:6379(processCommandAndResetClient(client*, int)+0x65) [0x5d9685] /opt/redis/redis-server 0.0.0.0:6379(processInputBuffer(client*, bool, int)+0x174) [0x5dc2a4] /opt/redis/redis-server 0.0.0.0:6379(processClients()+0xc9) [0x5dd799] /opt/redis/redis-server 0.0.0.0:6379() [0x6a27b2] /opt/redis/redis-server 0.0.0.0:6379(beforeSleep(aeEventLoop*)+0x16e) [0x59bc4e] /opt/redis/redis-server 0.0.0.0:6379(aeProcessEvents+0xe8) [0x597858] /opt/redis/redis-server 0.0.0.0:6379(aeMain+0x37) [0x59b947] /opt/redis/redis-server 0.0.0.0:6379(workerThreadMain(void*)+0x64) [0x6a2844] /usr/lib64/libpthread.so.0(+0x8f4b) [0x7fab63752f4b] /usr/lib64/libc.so.6(clone+0x3f) [0x7fab6368a7ef]
@JohnSully @benschermel John and Ben, i found a crash which happens only where there are pending writes on the client, and it triggers an assert failure. Please review the fix. The detail description can be found in the MR. Thank you.
@JohnSully @benschermel John and Ben, based on our discussion, I changed to using LL_DEBUG. Please merge it. Thank you.
@JohnSully @benschermel John and Ben, a reminder please review and merge this fix. Thank you.
commits: https://github.com/Snapchat/KeyDB/commit/560d8a839cb2c9c37cc8c919dee0435122e746ee https://github.com/Snapchat/KeyDB/commit/496bc38fda87face2601173af9357275476b0657