netdata-cloud icon indicating copy to clipboard operation
netdata-cloud copied to clipboard

[Feat]: RHEL 7.9 kernel versions < 3.10.0-1062 are buggy and incompatible with Netdata

Open ktsaou opened this issue 11 months ago • 2 comments

Problem

We found out that Redhat Enterprise Linux 7.9 with the following kernel versions have a bug and corrupt outbound sockets after some time.

  • 3.10.0-693
  • 3.10.0-862
  • 3.10.0-957

RHEL 7.9 with Kernel versions above or equal to 3.10.0-1062 are not affected, so we concluded that all versions below 3.10.0-1062 are likely affected by the kernel bug.

What is the kernel bug

The kernel bug is briefly described in RH advisory: https://access.redhat.com/errata/RHSA-2019:0512 As:

kernel: Memory corruption due to incorrect socket cloning (CVE-2018-9568)

The effect for Netdata is that outbound connections stall. The Netdata agent figures this out and re-opens the socket. But the destination of the socket (Netdata Parent, Netdata Cloud) sees the same agent attempting to connect twice.

  • For Netdata Parents, this results in delays in reconnection, constant re-initiation of replication and missing nodes and charts.
  • For Netdata Cloud, this results in delays to be allowed to reconnect again, delays in rendering dashboards and missing nodes and charts on the dashboards.

How can users see the problem

This is logged in both Netdata agents and parents. Both should log frequent reconnects.

Users using Netdata Parents can view this problem in the Netdata / Agent / Streaming section of the dashboard (of the Parent):

Image

The first chart above is available on all agent versions above 2.2.1. This chart should always show agents replicating and stale disconnected.

The second chart is available only when [pulse].extended = yes in netdata.conf. This chart should show regular non-zero values in disconnected stale receiver and already connected.

Solution

We think we should proactively inform users about this, similar to what we do for obsolete versions of the agents (the red ribbon at the top of the dashboard).

Image

Informing users should allow them proactively deal with the problem, instead of thinking that Netdata is problematic in some sense and does work properly.

Description

as above

Importance

really want

Value proposition

as described

Proposed implementation

as described

ktsaou avatar Jan 28 '25 13:01 ktsaou

cc @ralphm

ktsaou avatar Jan 28 '25 13:01 ktsaou

So, I'm going to check for os RHEL v7.9 (not other versions) with kernel version below 3.10.0-1062 and if found, I will show the banner with a message like this (contains the link to the bug report):

Image

I guess we don't want to check offline nodes but we want to check stale nodes right?

@netdata/product

kapantzak avatar Feb 06 '25 10:02 kapantzak