netdata
netdata copied to clipboard
[Bug]: Stream compression - Compressor buffer overflow causes a stream corruption (limit 16834 bytes)
Bug description
A buffer compressor overflow occurs when the message under compression exceeds the size of 16834 bytes. When the buffer of the compressor is full, the compressor function reports an error and simply skips the transmission of the data. The effect of this "not transmitted" data to the parent is closely related with the importance of the information not being transmitted.
-
In the case of the bug in the production environment, the parent's error.log file reported the following error message,
STREAM_RECEIVER[gke-production-main-xxxx-xxxx, [0.0.0.0]:0000] : requested a BEGIN on chart 'k8s_kubelet.kubelet_pods_log_filesystem_used_bytes', which does not exist on host 'gke-production-main-xxxx-xxxx'. Disabling it. (errno 22, Invalid argument)This means, that the parent received a BEGIN command for an 'unknown' chart and caused a parser error that would result in a reconnection. The problem here is that the chart definition was probably not streamed to the parent because thek8s_kubelet.kubelet_pods_log_filesystem_used_byteschart and the included dimensions (12) seemed to exceed the size of 16kB. -
Trying to reproduce the same behavior with a go.d example plugin and one chart with 1000 dimensions, the compressor buffer overflow was detected and the stream corruption was detected by continuous reporting the following error message,
Compression error - data discarded
Message size above limit:
Credits to @stelfrag and @MrZammler for reporting and helping to identify this issue.
Expected behavior
Definitely don't corrupt the stream between parent <-> child. Possible solutions include,
- Maintain the stream between parent <-> child and downgrade to version protocol 4.
- Increase the compressor buffer size to increase robustness of stream compression.
- Split the msgs in smaller blocks to fit the compressors buffer.
Steps to reproduce
- Set-up a simple parent <-> child connection with the master branch.
- Enable stream compression in the
stream.conffile for both agents.
[stream]
enable compression = yes
In the child Netdata agent,
- cd in
/etc/netdataand runsudo ./edit-config go.d.conf. - Enable example go-plugin
# dockerhub: yes
# elasticsearch: yes
example: yes
# filecheck: yes
# fluentd: yes
- Create a chart with many dimensions in
sudo ./edit-config go.d/example.conf
jobs:
- name: stress
charts:
num: 2
dimensions: 300
- Restart both agents
- Look in the child
error.logfor the message,
Compression error - data discarded
Message size above limit:
- And child <-> parent stream should be corrupted.
Installation method
from source
System info
Linux server2 5.4.0-90-generic #101-Ubuntu SMP Fri Oct 15 20:00:55 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
/etc/lsb-release:DISTRIB_ID=Ubuntu
/etc/lsb-release:DISTRIB_RELEASE=20.04
/etc/lsb-release:DISTRIB_CODENAME=focal
/etc/lsb-release:DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"
/etc/os-release:NAME="Ubuntu"
/etc/os-release:VERSION="20.04.3 LTS (Focal Fossa)"
/etc/os-release:ID=ubuntu
/etc/os-release:ID_LIKE=debian
/etc/os-release:PRETTY_NAME="Ubuntu 20.04.3 LTS"
/etc/os-release:VERSION_ID="20.04"
/etc/os-release:VERSION_CODENAME=focal
/etc/os-release:UBUNTU_CODENAME=focal
Netdata build info
Version: netdata v1.32.1-114
Features:
dbengine: YES
Native HTTPS: YES
Netdata Cloud: YES
ACLK Next Generation: YES
ACLK-NG New Cloud Protocol: YES
ACLK Legacy: NO
TLS Host Verification: YES
Machine Learning: YES
Stream Compression: YES
Libraries:
protobuf: NO
jemalloc: NO
JSON-C: YES
libcap: NO
libcrypto: YES
libm: YES
tcalloc: NO
zlib: YES
Plugins:
apps: YES
cgroup Network Tracking: YES
CUPS: NO
EBPF: YES
IPMI: NO
NFACCT: NO
perf: YES
slabinfo: YES
Xen: NO
Xen VBD Error Tracking: NO
Exporters:
AWS Kinesis: NO
GCP PubSub: NO
MongoDB: NO
Prometheus Remote Write: NO
Additional info
No response
@odynik please don't forget to add the Netdata Agent version to the "Netdata build info" section. It is expected to have the whole "buildinfo" output in there.
@odynik please don't forget to add the Netdata Agent version to the "Netdata build info" section. It is expected to have the whole "buildinfo" output in there.
Done. Thanks for spotting this @ilyam8. From now on, I will.
Update
Current working solution - 1. Maintain the stream between parent <-> child and downgrade to version protocol 4 for this release.
If the sender thread experiences a compressor buffer overflow will deactivate stream compression and re-establish a fresh link with version protocol 4.
I have tested the solution - 2. Increase the compressor buffer size to increase robustness of stream compression and seems that it is not that simple to change the compressor buffer size. LZ4 stream works with 64KBs block size so need to split the sender build buffer in smaller chunks and feed the compressor.
The solution - 3. Split the msgs in smaller blocks to fit the compressors buffer will follow up as an improvement and a new issue. In this solution, the sender build buffer should be split into compatible blocks to fit lz4 stream buffer constraints.