Logs consuming significant compute during cluster node failure detection

Open sarthakaggarwal97 opened this issue 7 months ago • 2 comments

The problem/use-case that the feature addresses

The issue is to discuss the server log emitted during cluster failover in large clusters. It looks like that this particular log takes about 3-4% of the total compute (which is already around 100%).

serverLog(LL_NOTICE, "Node %.40s (%s) reported node %.40s (%s) as not reachable.", sender->name,
     sender->human_nodename, node->name, node->human_nodename);

Sharing the profile over here:

Description of the feature I would like to discuss if we can reduce the severity of this log.

May 13 '25 06:05 sarthakaggarwal97

One of the issues is that we are constantly opening and closing the log file for each write. There might be ways to buffer the write specifically for this.

May 13 '25 22:05 madolson

For discussion about keeping the log fd open, see

#906

May 19 '25 20:05 zuiderkwast

I think LTTng is the way to go for generic logging when possible. #2135

Jun 08 '25 22:06 PingXie