Logs consuming significant compute during cluster node failure detection
The problem/use-case that the feature addresses
The issue is to discuss the server log emitted during cluster failover in large clusters. It looks like that this particular log takes about 3-4% of the total compute (which is already around 100%).
serverLog(LL_NOTICE, "Node %.40s (%s) reported node %.40s (%s) as not reachable.", sender->name,
sender->human_nodename, node->name, node->human_nodename);
Sharing the profile over here:
Description of the feature I would like to discuss if we can reduce the severity of this log.
One of the issues is that we are constantly opening and closing the log file for each write. There might be ways to buffer the write specifically for this.
For discussion about keeping the log fd open, see
- #906
I think LTTng is the way to go for generic logging when possible. #2135