valkey icon indicating copy to clipboard operation
valkey copied to clipboard

Logs consuming significant compute during cluster node failure detection

Open sarthakaggarwal97 opened this issue 7 months ago • 2 comments

The problem/use-case that the feature addresses

The issue is to discuss the server log emitted during cluster failover in large clusters. It looks like that this particular log takes about 3-4% of the total compute (which is already around 100%).

serverLog(LL_NOTICE, "Node %.40s (%s) reported node %.40s (%s) as not reachable.", sender->name,
     sender->human_nodename, node->name, node->human_nodename);

Sharing the profile over here:

Image

Description of the feature I would like to discuss if we can reduce the severity of this log.

sarthakaggarwal97 avatar May 13 '25 06:05 sarthakaggarwal97

One of the issues is that we are constantly opening and closing the log file for each write. There might be ways to buffer the write specifically for this.

madolson avatar May 13 '25 22:05 madolson

For discussion about keeping the log fd open, see

  • #906

zuiderkwast avatar May 19 '25 20:05 zuiderkwast

I think LTTng is the way to go for generic logging when possible. #2135

PingXie avatar Jun 08 '25 22:06 PingXie