[Feat]: RHEL 7.9 kernel versions < 3.10.0-1062 are buggy and incompatible with Netdata
Problem
We found out that Redhat Enterprise Linux 7.9 with the following kernel versions have a bug and corrupt outbound sockets after some time.
- 3.10.0-693
- 3.10.0-862
- 3.10.0-957
RHEL 7.9 with Kernel versions above or equal to 3.10.0-1062 are not affected, so we concluded that all versions below 3.10.0-1062 are likely affected by the kernel bug.
What is the kernel bug
The kernel bug is briefly described in RH advisory: https://access.redhat.com/errata/RHSA-2019:0512 As:
kernel: Memory corruption due to incorrect socket cloning (CVE-2018-9568)
The effect for Netdata is that outbound connections stall. The Netdata agent figures this out and re-opens the socket. But the destination of the socket (Netdata Parent, Netdata Cloud) sees the same agent attempting to connect twice.
- For Netdata Parents, this results in delays in reconnection, constant re-initiation of replication and missing nodes and charts.
- For Netdata Cloud, this results in delays to be allowed to reconnect again, delays in rendering dashboards and missing nodes and charts on the dashboards.
How can users see the problem
This is logged in both Netdata agents and parents. Both should log frequent reconnects.
Users using Netdata Parents can view this problem in the Netdata / Agent / Streaming section of the dashboard (of the Parent):
The first chart above is available on all agent versions above 2.2.1. This chart should always show agents replicating and stale disconnected.
The second chart is available only when [pulse].extended = yes in netdata.conf. This chart should show regular non-zero values in disconnected stale receiver and already connected.
Solution
We think we should proactively inform users about this, similar to what we do for obsolete versions of the agents (the red ribbon at the top of the dashboard).
Informing users should allow them proactively deal with the problem, instead of thinking that Netdata is problematic in some sense and does work properly.
Description
as above
Importance
really want
Value proposition
as described
Proposed implementation
as described
cc @ralphm
So, I'm going to check for os RHEL v7.9 (not other versions) with kernel version below 3.10.0-1062 and if found, I will show the banner with a message like this (contains the link to the bug report):
I guess we don't want to check offline nodes but we want to check stale nodes right?
@netdata/product