valkey icon indicating copy to clipboard operation
valkey copied to clipboard

Save cluster file in bio to avoid the stuck io latency

Open enjoy-binbin opened this issue 4 months ago • 3 comments

When the cluster changes, we need to persist the cluster configuration, if I/O is delayed or blocked, possibly by disk contention, this may result in large latencies on the main thread.

We should avoid synchronous I/O from the main thread. So in this commit, we will try to bio to save the config file. We add a bio job and send a sds version of the config file, which does the synchronous save, so there is some eventually consistent version consistently stored on disk.

This may break our previous assumption that nodes.conf is in sync and has the strong consistency. For shutdown and cluster saveconfig, we will wait for the bio job to get drained and trigger a new save in a sync way.

Closes #2424.

enjoy-binbin avatar Aug 27 '25 12:08 enjoy-binbin

Codecov Report

:x: Patch coverage is 96.00000% with 2 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 72.49%. Comparing base (04d0bba) to head (8dec4e0). :warning: Report is 1 commits behind head on unstable.

Files with missing lines Patch % Lines
src/bio.c 85.71% 1 Missing :warning:
src/cluster_legacy.c 97.67% 1 Missing :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2555      +/-   ##
============================================
+ Coverage     72.41%   72.49%   +0.07%     
============================================
  Files           129      129              
  Lines         70528    70548      +20     
============================================
+ Hits          51076    51146      +70     
+ Misses        19452    19402      -50     
Files with missing lines Coverage Δ
src/bio.c 85.10% <85.71%> (-0.19%) :arrow_down:
src/cluster_legacy.c 87.60% <97.67%> (+0.10%) :arrow_up:

... and 17 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Aug 27 '25 12:08 codecov[bot]

Core team meeting:

  1. Added some reviewers to make sure this makes progress, since this seems to have been forgotten. Some concerns raised about double voting.

madolson avatar Nov 24 '25 17:11 madolson

Can we target this fix for upcoming patch release?.

cherukum-Amazon avatar Nov 26 '25 20:11 cherukum-Amazon

one concern about shutdown, I think we should call bioDrainWorker in finishShutdown to wait the cluster config write done.

soloestoy avatar Dec 04 '25 06:12 soloestoy

@cherukum-Amazon sorry for the dealy, i somehow lost the context a while ago, i will try to refresh it this week. Let's start working on #1032 first and try to push it forward.

one concern about shutdown, I think we should call bioDrainWorker in finishShutdown to wait the cluster config write done.

yes, we do call bioDrainWorker in finishShutdown.

enjoy-binbin avatar Dec 08 '25 04:12 enjoy-binbin