opentelemetry-collector
opentelemetry-collector copied to clipboard
Suspected memory leak in batch processor
While using batch processor for logs , the collector works fine for few time but suddenly a suspected memory leak occurs making memory to rise exponentially. No abnormality is recorded in the debug logs of the collector when this occurs. Would be better if anyone help in knowing the possible cause for this in the batch processor.
Can you share your configuration file?
@fatsheep9146
apiVersion: v1
kind: ConfigMap
metadata:
name: collector-config
data:
collector.yaml: |
receivers:
otlp:
protocols:
grpc:
http:
exporters:
googlecloud:
retry_on_failure:
enabled: true
project: codenation-186008
log:
default_log_name: app
logging:
loglevel: debug
processors:
memory_limiter:
check_interval: 1s
limit_mib: 500
spike_limit_mib: 100
batch:
send_batch_size: 50000
timeout: 60s
attributes:
actions:
- key: gcp.trace_sampled
value: true
action: upsert
transform/1:
logs:
queries:
- set(attributes["traceId"],trace_id.string)
- set(attributes["service.instance.id"],resource.attributes["service.instance.id"])
- set(attributes["service.name"],resource.attributes["service.name"])
- set(attributes["k8s-pod/run"],resource.attributes["k8s-pod/run"])
- set(attributes["k8s.cluster.name"],resource.attributes["k8s.cluster.name"])
- set(attributes["host.name"],resource.attributes["host.name"])
- set(attributes["container.id"],resource.attributes["container.id"])
- set(attributes["cloud.region"],resource.attributes["cloud.region"])
groupbyattrs:
keys:
- traceId
transform/2:
logs:
queries:
- keep_keys(resource.attributes, "")
transform/3:
logs:
queries:
- set(resource.attributes["severity"],"ERROR") where severity_text=="ERROR"
- set(resource.attributes["severity"],"ERROR") where severity_text=="Error"
filter:
logs:
include:
match_type: strict
resource_attributes:
- Key: severity
Value: ERROR
tail_sampling:
decision_wait: 20s
policies:
- name: error_otel_status
type: status_code
status_code:
status_codes:
- ERROR
- name: error_http_status
type: numeric_attribute
numeric_attribute:
key: http.status_code
min_value: 400
service:
telemetry:
logs:
level: debug
pipelines:
traces:
receivers: [otlp]
processors: [batch, attributes, tail_sampling]
exporters: [googlecloud]
logs:
receivers: [ otlp ]
processors: [ batch,attributes, transform/1,transform/2, groupbyattrs, transform/3,filter ]
exporters: [ logging,googlecloud ]
Why do you believe the problem is caused by the batch processor? Does removing it from the processors list solve the problem? Did you try moving it to the end?
@dmitryax yes removing the batch problem solved the problem.
@davidgargti20 can you try moving it further in the list of processor and see if the issue goes away?
@dmitryax it didn't go away tried earlier
While using batch processor for logs , the collector works fine for few time but suddenly a suspected memory leak occurs making memory to rise exponentially.
Does this mean that the memory rise only happens in the pipeline of logs, not metrics?
Does the amount of log input to collector change rapidly when the memory rise happen?
BTW, what language sdk do you use to export log data in otlp protocal to collector? I only know that Golang does not implement any sdk about log.
I have encountered similar issue. Not only batch processor but also the hostmetrics receiver/scrapehelper seesm to have memory leak issue too. I am using hostmetrics with 0.1 seconds to exaggerate the issue.
See pictures for my pprof heap results
Could you show your config file of collector? @gen-xu
Could you show your config file of collector? @gen-xu
@fatsheep9146
we had some secrets in config but the following should reproduce the unbounded increasing of memory usage.
extensions:
health_check:
pprof:
endpoint: "0.0.0.0:1777"
block_profile_fraction: 3
mutex_profile_fraction: 5
receivers:
hostmetrics:
collection_interval: 0.1s
scrapers:
cpu:
load:
memory:
paging:
process:
processes:
network:
disk:
filesystem:
otlp:
protocols:
grpc:
http:
processors:
batch:
timeout: 1s
exporters:
kafka:
brokers:
- localhost:9092
protocol_version: "3.0.0"
producer:
max_message_bytes: 10000000
flush_max_messages: 16
metadata:
retry:
max: 30
backoff: 3s
logging:
loglevel: info
service:
pipelines:
logs:
receivers:
- otlp
processors:
- batch
exporters:
- kafka
- logging
traces:
receivers:
- otlp
processors:
- batch
exporters:
- kafka
- logging
metrics:
receivers:
- hostmetrics
- otlp
processors:
- batch
exporters:
- kafka
- logging
extensions:
- health_check
- pprof
it is worth noting that there are many errored metrics, not sure that can causes some dangling objects
Aug 15 03:02:08 ubuntu otelcol-contrib[3140977]: 2022-08-15T03:02:08.697Z error scraperhelper/scrapercontroller.go:197 Error scraping metrics {"kind": "receiver", "name": "hostmetrics", "pipeline": "metrics", "error": "error reading process name for pid 2: readlink /proc/2/exe: no such file or directory; error reading process name for pid 3: readlink /proc/3/exe: no such file or directory; error reading process name for pid 4: readlink /proc/4/exe: no such file or directory; error reading process name for pid 5: readlink /proc/5/exe: no such file or directory; error reading process name for pid 7: readlink /proc/7/exe: no such file or directory; error reading process name for pid 9: readlink /proc/9/exe: no such file or directory; error reading process name for pid 10: readlink /proc/10/exe: no such file or directory;
and the metrics might also be helpful here
# HELP otelcol_exporter_enqueue_failed_log_records Number of log records failed to be added to the sending queue.
# TYPE otelcol_exporter_enqueue_failed_log_records counter
otelcol_exporter_enqueue_failed_log_records{exporter="kafka",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
otelcol_exporter_enqueue_failed_log_records{exporter="logging",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
# HELP otelcol_exporter_enqueue_failed_metric_points Number of metric points failed to be added to the sending queue.
# TYPE otelcol_exporter_enqueue_failed_metric_points counter
otelcol_exporter_enqueue_failed_metric_points{exporter="kafka",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
otelcol_exporter_enqueue_failed_metric_points{exporter="logging",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
# HELP otelcol_exporter_enqueue_failed_spans Number of spans failed to be added to the sending queue.
# TYPE otelcol_exporter_enqueue_failed_spans counter
otelcol_exporter_enqueue_failed_spans{exporter="kafka",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
otelcol_exporter_enqueue_failed_spans{exporter="logging",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
# HELP otelcol_exporter_queue_capacity Fixed capacity of the retry queue (in batches)
# TYPE otelcol_exporter_queue_capacity gauge
otelcol_exporter_queue_capacity{exporter="kafka",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 5000
# HELP otelcol_exporter_queue_size Current size of the retry queue (in batches)
# TYPE otelcol_exporter_queue_size gauge
otelcol_exporter_queue_size{exporter="kafka",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 16
# HELP otelcol_exporter_sent_metric_points Number of metric points successfully sent to destination.
# TYPE otelcol_exporter_sent_metric_points counter
otelcol_exporter_sent_metric_points{exporter="kafka",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 2.167837e+06
otelcol_exporter_sent_metric_points{exporter="logging",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 2.400797e+06
# HELP otelcol_process_cpu_seconds Total CPU user and system time in seconds
# TYPE otelcol_process_cpu_seconds counter
otelcol_process_cpu_seconds{service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 151.60000000000002
# HELP otelcol_process_memory_rss Total physical memory (resident set size)
# TYPE otelcol_process_memory_rss gauge
otelcol_process_memory_rss{service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 3.44559616e+08
# HELP otelcol_process_runtime_heap_alloc_bytes Bytes of allocated heap objects (see 'go doc runtime.MemStats.HeapAlloc')
# TYPE otelcol_process_runtime_heap_alloc_bytes gauge
otelcol_process_runtime_heap_alloc_bytes{service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 1.9367716e+08
# HELP otelcol_process_runtime_total_alloc_bytes Cumulative bytes allocated for heap objects (see 'go doc runtime.MemStats.TotalAlloc')
# TYPE otelcol_process_runtime_total_alloc_bytes counter
otelcol_process_runtime_total_alloc_bytes{service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 4.7210411432e+10
# HELP otelcol_process_runtime_total_sys_memory_bytes Total bytes of memory obtained from the OS (see 'go doc runtime.MemStats.Sys')
# TYPE otelcol_process_runtime_total_sys_memory_bytes gauge
otelcol_process_runtime_total_sys_memory_bytes{service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 3.19738952e+08
# HELP otelcol_process_uptime Uptime of the process
# TYPE otelcol_process_uptime counter
otelcol_process_uptime{service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 96.713802677
# HELP otelcol_processor_batch_batch_send_size Number of units in the batch
# TYPE otelcol_processor_batch_batch_send_size histogram
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="10"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="25"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="50"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="75"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="100"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="250"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="500"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="750"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="1000"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="2000"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="3000"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="4000"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="5000"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="6000"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="7000"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="8000"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="9000"} 263
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="10000"} 268
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="20000"} 268
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="30000"} 268
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="50000"} 268
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="100000"} 268
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="+Inf"} 268
otelcol_processor_batch_batch_send_size_sum{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 2.400797e+06
otelcol_processor_batch_batch_send_size_count{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 268
# HELP otelcol_processor_batch_batch_size_trigger_send Number of times the batch was sent due to a size trigger
# TYPE otelcol_processor_batch_batch_size_trigger_send counter
otelcol_processor_batch_batch_size_trigger_send{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 268
# HELP otelcol_receiver_accepted_metric_points Number of metric points successfully pushed into the pipeline.
# TYPE otelcol_receiver_accepted_metric_points counter
otelcol_receiver_accepted_metric_points{receiver="hostmetrics",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",transport=""} 2.405277e+06
# HELP otelcol_receiver_refused_metric_points Number of metric points that could not be pushed into the pipeline.
# TYPE otelcol_receiver_refused_metric_points counter
otelcol_receiver_refused_metric_points{receiver="hostmetrics",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",transport=""} 0
# HELP otelcol_scraper_errored_metric_points Number of metric points that were unable to be scraped.
# TYPE otelcol_scraper_errored_metric_points counter
otelcol_scraper_errored_metric_points{receiver="hostmetrics",scraper="cpu",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
otelcol_scraper_errored_metric_points{receiver="hostmetrics",scraper="disk",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
otelcol_scraper_errored_metric_points{receiver="hostmetrics",scraper="filesystem",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
otelcol_scraper_errored_metric_points{receiver="hostmetrics",scraper="load",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
otelcol_scraper_errored_metric_points{receiver="hostmetrics",scraper="memory",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
otelcol_scraper_errored_metric_points{receiver="hostmetrics",scraper="network",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
otelcol_scraper_errored_metric_points{receiver="hostmetrics",scraper="paging",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
otelcol_scraper_errored_metric_points{receiver="hostmetrics",scraper="process",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 278613
otelcol_scraper_errored_metric_points{receiver="hostmetrics",scraper="processes",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
# HELP otelcol_scraper_scraped_metric_points Number of metric points successfully scraped.
# TYPE otelcol_scraper_scraped_metric_points counter
otelcol_scraper_scraped_metric_points{receiver="hostmetrics",scraper="cpu",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 1074
otelcol_scraper_scraped_metric_points{receiver="hostmetrics",scraper="disk",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 7518
otelcol_scraper_scraped_metric_points{receiver="hostmetrics",scraper="filesystem",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 2148
otelcol_scraper_scraped_metric_points{receiver="hostmetrics",scraper="load",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 3225
otelcol_scraper_scraped_metric_points{receiver="hostmetrics",scraper="memory",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 1074
otelcol_scraper_scraped_metric_points{receiver="hostmetrics",scraper="network",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 5375
otelcol_scraper_scraped_metric_points{receiver="hostmetrics",scraper="paging",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 3222
otelcol_scraper_scraped_metric_points{receiver="hostmetrics",scraper="process",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 1.159584e+06
otelcol_scraper_scraped_metric_points{receiver="hostmetrics",scraper="processes",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 2148