valkey icon indicating copy to clipboard operation
valkey copied to clipboard

Log cluster state periodically to capture transient state for debuggability

Open hpatro opened this issue 8 months ago • 2 comments

This PR logs CLUSTER INFO / CLUSTER NODES output every 5 seconds to the log file for verbose/debug loglevel mode.

Certain times few nodes are not in convergence with the entire cluster and there are no logs captured about the divergence. This logging could help us better analyze in test setup where we can aggressively log more cluster information.

hpatro avatar Apr 26 '25 07:04 hpatro

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 70.99%. Comparing base (0b94ca6) to head (4e7f83c). Report is 8 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2011      +/-   ##
============================================
- Coverage     71.01%   70.99%   -0.03%     
============================================
  Files           123      123              
  Lines         66033    66125      +92     
============================================
+ Hits          46892    46944      +52     
- Misses        19141    19181      +40     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.19% <100.00%> (+0.10%) :arrow_up:

... and 22 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Apr 26 '25 07:04 codecov[bot]

Is the main purpose for debugging? ie someone find the cluster is not normal and adjust the loglevel to verbose and catch it?

enjoy-binbin avatar May 08 '25 07:05 enjoy-binbin

Is the main purpose for debugging? ie someone find the cluster is not normal and adjust the loglevel to verbose and catch it?

Yes. Even to investigate incident which occurred in the past it's quite difficult for operators to figure out the issue with the current state of logging. I would like this to be active at NOTICE level with failed nodes information which is actually relevant https://github.com/valkey-io/valkey/pull/2011#discussion_r2067059493

hpatro avatar Jun 16 '25 17:06 hpatro