netdata icon indicating copy to clipboard operation
netdata copied to clipboard

[Bug]: Stream compression - Compressor buffer overflow causes a stream corruption (limit 16834 bytes)

Open odynik opened this issue 3 years ago • 3 comments

Bug description

A buffer compressor overflow occurs when the message under compression exceeds the size of 16834 bytes. When the buffer of the compressor is full, the compressor function reports an error and simply skips the transmission of the data. The effect of this "not transmitted" data to the parent is closely related with the importance of the information not being transmitted.

  1. In the case of the bug in the production environment, the parent's error.log file reported the following error message, STREAM_RECEIVER[gke-production-main-xxxx-xxxx, [0.0.0.0]:0000] : requested a BEGIN on chart 'k8s_kubelet.kubelet_pods_log_filesystem_used_bytes', which does not exist on host 'gke-production-main-xxxx-xxxx'. Disabling it. (errno 22, Invalid argument) This means, that the parent received a BEGIN command for an 'unknown' chart and caused a parser error that would result in a reconnection. The problem here is that the chart definition was probably not streamed to the parent because the k8s_kubelet.kubelet_pods_log_filesystem_used_bytes chart and the included dimensions (12) seemed to exceed the size of 16kB.

  2. Trying to reproduce the same behavior with a go.d example plugin and one chart with 1000 dimensions, the compressor buffer overflow was detected and the stream corruption was detected by continuous reporting the following error message,

Compression error - data discarded
Message size above limit:

Credits to @stelfrag and @MrZammler for reporting and helping to identify this issue.

Expected behavior

Definitely don't corrupt the stream between parent <-> child. Possible solutions include,

  1. Maintain the stream between parent <-> child and downgrade to version protocol 4.
  2. Increase the compressor buffer size to increase robustness of stream compression.
  3. Split the msgs in smaller blocks to fit the compressors buffer.

Steps to reproduce

  1. Set-up a simple parent <-> child connection with the master branch.
  2. Enable stream compression in the stream.conf file for both agents.
[stream]
enable compression = yes

In the child Netdata agent,

  1. cd in /etc/netdata and run sudo ./edit-config go.d.conf.
  2. Enable example go-plugin
#  dockerhub: yes
#  elasticsearch: yes
  example: yes
#  filecheck: yes
#  fluentd: yes
  1. Create a chart with many dimensions in sudo ./edit-config go.d/example.conf
jobs:
  - name: stress
    charts:
      num: 2
      dimensions: 300

  1. Restart both agents
  2. Look in the child error.log for the message,
Compression error - data discarded
Message size above limit:
  1. And child <-> parent stream should be corrupted.

Installation method

from source

System info

Linux server2 5.4.0-90-generic #101-Ubuntu SMP Fri Oct 15 20:00:55 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
/etc/lsb-release:DISTRIB_ID=Ubuntu
/etc/lsb-release:DISTRIB_RELEASE=20.04
/etc/lsb-release:DISTRIB_CODENAME=focal
/etc/lsb-release:DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"
/etc/os-release:NAME="Ubuntu"
/etc/os-release:VERSION="20.04.3 LTS (Focal Fossa)"
/etc/os-release:ID=ubuntu
/etc/os-release:ID_LIKE=debian
/etc/os-release:PRETTY_NAME="Ubuntu 20.04.3 LTS"
/etc/os-release:VERSION_ID="20.04"
/etc/os-release:VERSION_CODENAME=focal
/etc/os-release:UBUNTU_CODENAME=focal

Netdata build info

Version: netdata v1.32.1-114
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES
    ACLK Next Generation:       YES
    ACLK-NG New Cloud Protocol: YES
    ACLK Legacy:                NO
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         YES
Libraries:
    protobuf:                NO
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    NO
    EBPF:                    YES
    IPMI:                    NO
    NFACCT:                  NO
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: NO

Additional info

No response

odynik avatar Jan 20 '22 20:01 odynik

@odynik please don't forget to add the Netdata Agent version to the "Netdata build info" section. It is expected to have the whole "buildinfo" output in there.

ilyam8 avatar Jan 20 '22 20:01 ilyam8

@odynik please don't forget to add the Netdata Agent version to the "Netdata build info" section. It is expected to have the whole "buildinfo" output in there.

Done. Thanks for spotting this @ilyam8. From now on, I will.

odynik avatar Jan 20 '22 21:01 odynik

Update

Current working solution - 1. Maintain the stream between parent <-> child and downgrade to version protocol 4 for this release.

If the sender thread experiences a compressor buffer overflow will deactivate stream compression and re-establish a fresh link with version protocol 4.

I have tested the solution - 2. Increase the compressor buffer size to increase robustness of stream compression and seems that it is not that simple to change the compressor buffer size. LZ4 stream works with 64KBs block size so need to split the sender build buffer in smaller chunks and feed the compressor.

The solution - 3. Split the msgs in smaller blocks to fit the compressors buffer will follow up as an improvement and a new issue. In this solution, the sender build buffer should be split into compatible blocks to fit lz4 stream buffer constraints.

odynik avatar Jan 21 '22 14:01 odynik