valkey icon indicating copy to clipboard operation
valkey copied to clipboard

Replica flush the old data after RDB file is ok in disk-based replication

Open enjoy-binbin opened this issue 1 year ago • 5 comments

Call emptyData right before rdbLoad to prevent errors in the middle and we drop the replication stream and leaving an empty database.

enjoy-binbin avatar Aug 20 '24 10:08 enjoy-binbin

Codecov Report

Attention: Patch coverage is 57.14286% with 3 lines in your changes missing coverage. Please review.

Project coverage is 70.63%. Comparing base (7795152) to head (f4316a0). Report is 80 commits behind head on unstable.

Files with missing lines Patch % Lines
src/replication.c 57.14% 3 Missing :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #926      +/-   ##
============================================
+ Coverage     70.37%   70.63%   +0.25%     
============================================
  Files           112      114       +2     
  Lines         61505    61670     +165     
============================================
+ Hits          43286    43559     +273     
+ Misses        18219    18111     -108     
Files with missing lines Coverage Δ
src/replication.c 87.29% <57.14%> (+0.14%) :arrow_up:

... and 41 files with indirect coverage changes

codecov[bot] avatar Aug 20 '24 10:08 codecov[bot]

@enjoy-binbin did we have a test failure here, or should we introduce new test for this case?

ranshid avatar Aug 20 '24 12:08 ranshid

did we have a test failure here, or should we introduce new test for this case?

emmm, i don't think we have a test failure here, we don't have test that will cover this case, i test it in local. We should actually add a new test to verify this, but that will require us to add some debug code. (or i can do try to make the rename fail, i can take a try)

enjoy-binbin avatar Aug 20 '24 15:08 enjoy-binbin

It's not a breaking change? The behavior change is that if the sync fails, the old data is still there, instead of deleted. It seems better. Nobody relies on the old behaviour I think...?

I don't think it is a breaking change (but somehow it can be?). Yes, actually it is, if the following fails, we can still keep the data in memory.

        /* Make sure the new file (also used for persistence) is fully synced
         * (not covered by earlier calls to rdb_fsync_range). */
        if (fsync(server.repl_transfer_fd) == -1) {
            ...
            cancelReplicationHandshake(1);
            return;
        }

        /* Rename rdb like renaming rewrite aof asynchronously. */
        int old_rdb_fd = open(server.rdb_filename, O_RDONLY | O_NONBLOCK);
        if (rename(server.repl_transfer_tmpfile, server.rdb_filename) == -1) {
            ...
            cancelReplicationHandshake(1);
            if (old_rdb_fd != -1) close(old_rdb_fd);
            return;
        }
        /* Close old rdb asynchronously. */
        if (old_rdb_fd != -1) bioCreateCloseJob(old_rdb_fd, 0, 0);

        /* Sync the directory to ensure rename is persisted */
        if (fsyncFileDir(server.rdb_filename) == -1) {
            ...
            cancelReplicationHandshake(1);
            return;
        }

enjoy-binbin avatar Aug 20 '24 15:08 enjoy-binbin

@valkey-io/core-team Do you think this is a breaking change?

zuiderkwast avatar Aug 20 '24 15:08 zuiderkwast

I do not think it is a break change, but it changes some functions call position, but I can not find the reason why we should do this way.

hwware avatar Aug 22 '24 14:08 hwware

i am going to merge this one, i don't think it is a breaking change, does any of you have other concerns?

enjoy-binbin avatar Sep 13 '24 02:09 enjoy-binbin