[error] [opentelemetry] snappy decompression failed issue
Bug Report
Describe the bug Actual Behavior: Fluent Bit throws a snappy decompression failed error. Metrics are missing in Grafana, likely due to issues with decompressing the data.
To Reproduce
fluent-bit pod
random time fail log
[error] [opentelemetry] snappy decompression failed
Screenshots
any all query
Your Environment prometheus-server latest fluent-bit latest opentelemetry-collector latest
configmap [SERVICE] Flush 5 Daemon Off Log_Level debug Config_Watch On HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port 2020 Health_Check On
[INPUT] name prometheus_remote_write listen 0.0.0.0 port 8080
[OUTPUT] name stdout match *
[OUTPUT] Name opentelemetry Match * Host opentelemetry-collector.default.svc.cluster.local Port 4318
Additional context Request: Could you please help investigate the root cause of the snappy decompression failed error? Additionally, if there are any configuration changes or updates to the OpenTelemetry exporter or Fluent Bit that might resolve this issue, that would be helpful.
pls share your full config, if you there is a way to get the payload that is generating the issue would be very helpful
@edsiper Is there a file or specific config file you want? All configurations are the default configurations of the latest version of helm chart. The only modified contents are the settings required for integration.
configmap
prometheus remote_write - url: "http://fluent-bit-metric.default.svc.cluster.local:8080" write_relabel_configs: - action: drop regex: (~~~) source_labels: [name] - action: drop regex: kubernetes-apiservers source_labels: [job] queue_config: capacity: 4000 # default 2500 max_shards: 50 # default = 200 min_shards: 10 # default = 1 max_samples_per_send: 2000 # default = 500 batch_send_deadline: 15s # default = 5s min_backoff: 30ms # default = 30ms max_backoff: 100ms # default = 100ms metadata_config: send_interval: 10s # default = 1m
fluent-bit same configmap
opentelemetry processors: batch: {} memory_limiter: check_interval: 5s limit_percentage: 80 spike_limit_percentage: 25 receivers: jaeger: protocols: grpc: endpoint: ${env:MY_POD_IP}:14250 thrift_compact: endpoint: ${env:MY_POD_IP}:6831 thrift_http: endpoint: ${env:MY_POD_IP}:14268 otlp: protocols: grpc: endpoint: ${env:MY_POD_IP}:4317 http: endpoint: ${env:MY_POD_IP}:4318 prometheus: config: scrape_configs: - job_name: opentelemetry-collector scrape_interval: 10s static_configs: - targets: - ${env:MY_POD_IP}:8888 zipkin: endpoint: ${env:MY_POD_IP}:9411 service: extensions:
- health_check
pipelines:
logs:
exporters:
- debug processors:
- memory_limiter
- batch receivers:
- otlp metrics: exporters:
- debug processors:
- memory_limiter
- batch receivers:
- otlp
- prometheus traces: exporters:
- debug processors:
- memory_limiter
- batch receivers:
- otlp
- jaeger
- zipkin telemetry: metrics: address: ${env:MY_POD_IP}:8888
This symptom also appears in the latest version.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.
That's a typo, the plugin that's failing is the prometheus_remote_write input plugin, could you explain to me what the data source is so I can try to set it up locally?
@leonardo-albertovich For operational reasons, we are using Fluent Bit with the Prometheus remote write input plugin. The data sources being scraped are fairly standard: kube-state-metrics, node-exporter, and blackbox-exporter.
This issue does not occur under normal setups, but it only appears when using Fluent Bit. To verify this behavior, we conducted a test using the same local configuration with Fluent Bit, and we were able to reproduce the issue.
test scenario prometheus (remote write) -> fluentbit (input : Prometheus remotewrite, output : prometheus) -> new prometheus <- Grafana (DataSource)
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.
This issue was closed because it has been stalled for 5 days with no activity.