netdata
netdata copied to clipboard
[Bug]: Servers appear stale or offline, but are in fact online and working
Bug description
It seems as if the netdata agent dies unexspected and does not recover. If the netdata process is monitored and automatically restarted with systemctl restart netdata, the server again appears as online in the dashboard.
This produces false "server unreachable" alerts.
Expected behavior
Netdata agent should never stop by itself.
Steps to reproduce
- Install the netdata agent via the dashboard supplied script without any modifications.
Installation method
kickstart.sh
System info
Linux target 5.4.0-105-generic #119-Ubuntu SMP Mon Mar 7 18:49:24 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
/etc/lsb-release:DISTRIB_ID=Ubuntu
/etc/lsb-release:DISTRIB_RELEASE=20.04
/etc/lsb-release:DISTRIB_CODENAME=focal
/etc/lsb-release:DISTRIB_DESCRIPTION="Ubuntu 20.04.6 LTS"
/etc/os-release:NAME="Ubuntu"
/etc/os-release:VERSION="20.04.6 LTS (Focal Fossa)"
/etc/os-release:ID=ubuntu
/etc/os-release:ID_LIKE=debian
/etc/os-release:PRETTY_NAME="Ubuntu 20.04.6 LTS"
/etc/os-release:VERSION_ID="20.04"
/etc/os-release:VERSION_CODENAME=focal
/etc/os-release:UBUNTU_CODENAME=focal
Netdata build info
Packaging:
Netdata Version ____________________________________________ : v1.44.3
Installation Type __________________________________________ : kickstart-static
Package Architecture _______________________________________ : x86_64
Package Distro _____________________________________________ : unknown
Configure Options __________________________________________ : '--prefix=/opt/netdata/usr' '--sysconfdir=/opt/netdata/etc' '--localstatedir=/opt/netdata/var' '--libexecdir=/opt/netdata/usr/libexec' '--libdir=/opt/netdata/usr/lib' '--with-math' '--with-user=netdata' '--disable-exporting-mongodb' '--enable-cloud' '--without-bundled-protobuf' '--disable-dependency-tracking' '--enable-lto' 'CFLAGS=-ffunction-sections -fdata-sections -static -O2 -funroll-loops -I/openssl-static/include -I/libnetfilter-acct-static/include/libnetfilter_acct -I/curl-local/include/curl -I/usr/include/libmnl -pipe' 'LDFLAGS=-Wl,--gc-sections -static -L/openssl-static/lib64 -L/libnetfilter-acct-static/lib -lnetfilter_acct -L/usr/lib -lmnl -L/usr/lib -lzstd -L/curl-local/lib' 'PKG_CONFIG=pkg-config --static' 'PKG_CONFIG_PATH=/openssl-static/lib64/pkgconfig:/libnetfilter-acct-static/lib/pkgconfig:/usr/lib/pkgconfig:/curl-local/lib/pkgconfig'
Default Directories:
User Configurations ________________________________________ : /opt/netdata/etc/netdata
Stock Configurations _______________________________________ : /opt/netdata/usr/lib/netdata/conf.d
Ephemeral Databases (metrics data, metadata) _______________ : /opt/netdata/var/cache/netdata
Permanent Databases ________________________________________ : /opt/netdata/var/lib/netdata
Plugins ____________________________________________________ : /opt/netdata/usr/libexec/netdata/plugins.d
Static Web Files ___________________________________________ : /opt/netdata/usr/share/netdata/web
Log Files __________________________________________________ : /opt/netdata/var/log/netdata
Lock Files _________________________________________________ : /opt/netdata/var/lib/netdata/lock
Home _______________________________________________________ : /opt/netdata/var/lib/netdata
Operating System:
Kernel _____________________________________________________ : Linux
Kernel Version _____________________________________________ : 5.4.0-105-generic
Operating System ___________________________________________ : Ubuntu
Operating System ID ________________________________________ : ubuntu
Operating System ID Like ___________________________________ : debian
Operating System Version ___________________________________ : 20.04.6 LTS (Focal Fossa)
Operating System Version ID ________________________________ : none
Detection __________________________________________________ : /etc/os-release
Hardware:
CPU Cores __________________________________________________ : 4
CPU Frequency ______________________________________________ : 3400000000
RAM Bytes __________________________________________________ : 16629256192
Disk Capacity ______________________________________________ : 5000991916032
CPU Architecture ___________________________________________ : x86_64
Virtualization Technology __________________________________ : none
Virtualization Detection ___________________________________ : systemd-detect-virt
Container:
Container __________________________________________________ : none
Container Detection ________________________________________ : systemd-detect-virt
Container Orchestrator _____________________________________ : none
Container Operating System _________________________________ : none
Container Operating System ID ______________________________ : none
Container Operating System ID Like _________________________ : none
Container Operating System Version _________________________ : none
Container Operating System Version ID ______________________ : none
Container Operating System Detection _______________________ : none
Features:
Built For __________________________________________________ : Linux
Netdata Cloud ______________________________________________ : YES
Health (trigger alerts and send notifications) _____________ : YES
Streaming (stream metrics to parent Netdata servers) _______ : YES
Back-filling (of higher database tiers) ____________________ : YES
Replication (fill the gaps of parent Netdata servers) ______ : YES
Streaming and Replication Compression ______________________ : YES (zstd lz4 gzip)
Contexts (index all active and archived metrics) ___________ : YES
Tiering (multiple dbs with different metrics resolution) ___ : YES (5)
Machine Learning ___________________________________________ : YES
Database Engines:
dbengine ___________________________________________________ : YES
alloc ______________________________________________________ : YES
ram ________________________________________________________ : YES
map ________________________________________________________ : YES
save _______________________________________________________ : YES
none _______________________________________________________ : YES
Connectivity Capabilities:
ACLK (Agent-Cloud Link: MQTT over WebSockets over TLS) _____ : YES
static (Netdata internal web server) _______________________ : YES
h2o (web server) ___________________________________________ : YES
WebRTC (experimental) ______________________________________ : NO
Native HTTPS (TLS Support) _________________________________ : YES
TLS Host Verification ______________________________________ : YES
Libraries:
LZ4 (extremely fast lossless compression algorithm) ________ : YES
ZSTD (fast, lossless compression algorithm) ________________ : YES
zlib (lossless data-compression library) ___________________ : YES
Judy (high-performance dynamic arrays and hashtables) ______ : YES (bundled)
dlib (robust machine learning toolkit) _____________________ : YES (bundled)
protobuf (platform-neutral data serialization protocol) ____ : YES (system)
OpenSSL (cryptography) _____________________________________ : YES
libdatachannel (stand-alone WebRTC data channels) __________ : NO
JSON-C (lightweight JSON manipulation) _____________________ : YES
libcap (Linux capabilities system operations) ______________ : NO
libcrypto (cryptographic functions) ________________________ : YES
libm (mathematical functions) ______________________________ : YES
jemalloc ___________________________________________________ : NO
TCMalloc ___________________________________________________ : NO
Plugins:
apps (monitor processes) ___________________________________ : YES
cgroups (monitor containers and VMs) _______________________ : YES
cgroup-network (associate interfaces to CGROUPS) ___________ : YES
proc (monitor Linux systems) _______________________________ : YES
tc (monitor Linux network QoS) _____________________________ : YES
diskspace (monitor Linux mount points) _____________________ : YES
freebsd (monitor FreeBSD systems) __________________________ : NO
macos (monitor MacOS systems) ______________________________ : NO
statsd (collect custom application metrics) ________________ : YES
timex (check system clock synchronization) _________________ : YES
idlejitter (check system latency and jitter) _______________ : YES
bash (support shell data collection jobs - charts.d) _______ : YES
debugfs (kernel debugging metrics) _________________________ : YES
cups (monitor printers and print jobs) _____________________ : NO
ebpf (monitor system calls) ________________________________ : YES
freeipmi (monitor enterprise server H/W) ___________________ : NO
nfacct (gather netfilter accounting) _______________________ : YES
perf (collect kernel performance events) ___________________ : YES
slabinfo (monitor kernel object caching) ___________________ : YES
Xen ________________________________________________________ : NO
Xen VBD Error Tracking _____________________________________ : NO
Logs Management ____________________________________________ : YES
Exporters:
AWS Kinesis ________________________________________________ : NO
GCP PubSub _________________________________________________ : NO
MongoDB ____________________________________________________ : NO
Prometheus (OpenMetrics) Exporter __________________________ : YES
Prometheus Remote Write ____________________________________ : YES
Graphite ___________________________________________________ : YES
Graphite HTTP / HTTPS ______________________________________ : YES
JSON _______________________________________________________ : YES
JSON HTTP / HTTPS __________________________________________ : YES
OpenTSDB ___________________________________________________ : YES
OpenTSDB HTTP / HTTPS ______________________________________ : YES
All Metrics API ____________________________________________ : YES
Shell (use metrics in shell scripts) _______________________ : YES
Debug/Developer Features:
Trace All Netdata Allocations (with charts) ________________ : NO
Developer Mode (more runtime checks, slower) _______________ : NO
Additional info
No response
@CRPrinzler to confirm on this case, the issue you are reporting is this? "netdata agent dies unexspected and does not recover." the bug title mentions "Servers appear stale or offline, but are in fact online and working" which based on the discussion on Discord was sorted with restarting the agent but this is after they had died unexpectedly?
On the, reachability notifications being triggered this is an after-effect of the above right?
it just seems weird because netdata agent seems to lose all connection and vanishes. System Status monitor, which I have set up, then automatically restarts via systemctl restart netdata.
@CRPrinzler could you look into the logs when this happens and see if something there could help give some hint into what is happening?
also, would it be possible to test this on the latest stable version v1.45.1
? there have been a lot of bug fixes released
@CRPrinzler were you able to try this using the latest stable version?