alloy
alloy copied to clipboard
Possible memory leak in Alloy
What's wrong?
I tried migrating from grafana-agent to alloy, while keeping the flow configuration. While grafana-agent was using ~1.1GB of memory, alloy needs much, much more, and the pod regularly hits OOM.
The instances between 15:00 and 16:30 ran without GOMEMLIMIT, and with higher memory limits. Currently running with the following settings:
extraEnv:
- name: GOMEMLIMIT
value: 2000MiB
resources:
requests:
memory: 2500Mi
cpu: 1
limits:
memory: 2500Mi
We use grafana-agent / alloy only for downsampling traces, and additionally generating spanmetrics and service graph. Config is attached below. This config (with minimal differences, we used a batch processor in front of the servicegraph connector to work around the missing metrics_flush_interval
option) has worked in a stable way with grafana-agent.
Steps to reproduce
Deploy alloy in kubernetes, with helm, as a drop in replacement for grafana-agent.
Feed traces from mimir/tempo/loki/grafana/prometheus
Watch as memory usage grows until pod is killed with OOM.
System information
Linux 5.10.209 aarch64 on EKS
Software version
Grafana Alloy v1.0.0
Configuration
tracing {
sampling_fraction = 1.0
write_to = [
otelcol.processor.k8sattributes.enrich.input,
otelcol.processor.transform.prepare_spanmetrics.input,
otelcol.connector.servicegraph.graph.input,
]
}
otelcol.receiver.otlp "main" {
grpc {
endpoint = "0.0.0.0:4317"
}
http {
endpoint = "0.0.0.0:4318"
}
output {
traces = [
otelcol.processor.k8sattributes.enrich.input,
otelcol.processor.transform.prepare_spanmetrics.input,
otelcol.connector.servicegraph.graph.input,
]
}
}
otelcol.receiver.jaeger "main" {
protocols {
grpc {
endpoint = "0.0.0.0:14250"
}
thrift_http {
endpoint = "0.0.0.0:14268"
}
thrift_compact {
endpoint = "0.0.0.0:6831"
}
}
output {
traces = [
otelcol.processor.k8sattributes.enrich.input,
otelcol.processor.transform.prepare_spanmetrics.input,
otelcol.connector.servicegraph.graph.input,
]
}
}
otelcol.processor.k8sattributes "enrich" {
output {
traces = [otelcol.processor.tail_sampling.trace_downsample.input]
}
}
otelcol.connector.servicegraph "graph" {
dimensions = ["http.method", "db.system"]
metrics_flush_interval = "60s"
store {
ttl = "30s"
}
output {
metrics = [otelcol.processor.attributes.manual_tagging.input]
}
}
otelcol.processor.transform "prepare_spanmetrics" {
error_mode = "ignore"
trace_statements {
context = "resource"
statements = [
`keep_keys(attributes, ["service.name"])`,
]
}
output {
traces = [otelcol.connector.spanmetrics.stats.input]
}
}
otelcol.connector.spanmetrics "stats" {
histogram {
unit = "ms"
exponential {
}
}
exemplars {
enabled = true
}
aggregation_temporality = "CUMULATIVE"
namespace = "traces_spanmetrics_"
metrics_flush_interval = "60s"
output {
metrics = [otelcol.processor.attributes.manual_tagging.input]
}
}
otelcol.processor.tail_sampling "trace_downsample" {
policy {
name = "include-all-errors"
type = "status_code"
status_code {
status_codes = ["ERROR"]
}
}
policy {
name = "include-all-slow-traces"
type = "latency"
latency {
threshold_ms = 5000
}
}
policy {
name = "include-all-diagnostic-mode-traces"
type = "boolean_attribute"
boolean_attribute {
key = "diagnostics_mode"
value = true
}
}
policy {
name = "downsample-all-others"
type = "probabilistic"
probabilistic {
sampling_percentage = 1
}
}
output {
traces = [otelcol.processor.attributes.manual_tagging.input]
}
}
otelcol.processor.attributes "manual_tagging" {
action {
action = "insert"
key = "garden"
value = "global"
}
action {
action = "insert"
key = "generated_by"
value = "grafana_alloy"
}
action {
action = "insert"
key = "grafana_alloy_hostname"
value = constants.hostname
}
output {
traces = [otelcol.processor.batch.main.input]
metrics = [
otelcol.processor.batch.main.input,
]
}
}
otelcol.processor.batch "main" {
output {
metrics = [otelcol.exporter.prometheus.global.input]
traces = [otelcol.exporter.otlphttp.tempo.input]
}
}
otelcol.exporter.otlphttp "tempo" {
client {
endpoint = "http://tempo-distributor:4318"
}
sending_queue {
queue_size = 20000
}
retry_on_failure {
enabled = true
max_elapsed_time = "2h"
max_interval = "1m"
}
}
otelcol.exporter.prometheus "global" {
include_target_info = false
resource_to_telemetry_conversion = true
gc_frequency = "1h"
forward_to = [prometheus.remote_write.global.receiver]
}
prometheus.remote_write "global" {
endpoint {
url = "http://mimir-distributor:8080/api/v1/push"
send_native_histograms = true
}
}
Logs
Containers:
alloy:
Container ID: containerd://543b6573ecc932c15929c6ac2508ebcb72a52eddf24475590afbe13fe4bdcc84
Image: docker.io/grafana/alloy:v1.0.0
Image ID: docker.io/grafana/alloy@sha256:21248ad12831ad8f7279eb40ecd161b2574c2194ca76e7413996666d09beef6c
Ports: 12345/TCP, 4317/TCP, 4318/TCP, 14250/TCP, 14268/TCP, 6831/UDP
Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/UDP
Args:
run
/etc/alloy/config.alloy
--storage.path=/tmp/alloy
--server.http.listen-addr=0.0.0.0:12345
--server.http.ui-path-prefix=/
--stability.level=generally-available
State: Running
Started: Wed, 24 Apr 2024 20:46:12 +0200
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Wed, 24 Apr 2024 20:37:28 +0200
Finished: Wed, 24 Apr 2024 20:45:39 +0200
Ready: True
Restart Count: 3
Limits:
memory: 2500Mi
Requests:
cpu: 1
memory: 2500Mi
Readiness: http-get http://:12345/-/ready delay=10s timeout=1s period=10s #success=1 #failure=3
Environment:
ALLOY_DEPLOY_MODE: helm
HOSTNAME: (v1:spec.nodeName)
GOMEMLIMIT: 2000MiB
Mounts:
/etc/alloy from config (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4b49j (ro)
config-reloader:
Container ID: containerd://ebf2f2409762dccace3af437344667fc73318507e59f6ab38f217813a91224eb
Image: ghcr.io/jimmidyson/configmap-reload:v0.12.0
Image ID: ghcr.io/jimmidyson/configmap-reload@sha256:a7c754986900e41fc47656bdc8dfce33227112a7cce547e0d9ef5d279f4f8e99
Port: <none>
Host Port: <none>
Args:
--volume-dir=/etc/alloy
--webhook-url=http://localhost:12345/-/reload
State: Running
Started: Wed, 24 Apr 2024 20:20:47 +0200
Ready: True
Restart Count: 0
Requests:
cpu: 1m
memory: 5Mi
Environment: <none>
Mounts:
/etc/alloy from config (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4b49j (ro)
Additional logs from the oom killer:
[13349.644248] alloy invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=679
[13349.645913] CPU: 1 PID: 77501 Comm: alloy Not tainted 5.10.209-198.858.amzn2.aarch64 #1
[13349.647382] Hardware name: Amazon EC2 m7g.large/, BIOS 1.0 11/1/2018
[13349.648527] Call trace:
[13349.648976] dump_backtrace+0x0/0x204
[13349.649715] show_stack+0x1c/0x24
[13349.650387] dump_stack+0xe4/0x12c
[13349.651052] dump_header+0x4c/0x1f0
[13349.651769] oom_kill_process+0x24c/0x250
[13349.652594] out_of_memory+0xdc/0x344
[13349.653326] mem_cgroup_out_of_memory+0x130/0x148
[13349.654279] try_charge+0x55c/0x5cc
[13349.654967] mem_cgroup_charge+0x80/0x240
[13349.655772] do_anonymous_page+0xb8/0x574
[13349.656580] handle_pte_fault+0x1a0/0x218
[13349.657379] __handle_mm_fault+0x1e0/0x380
[13349.658210] handle_mm_fault+0xcc/0x230
[13349.659000] do_page_fault+0x14c/0x410
[13349.659723] do_translation_fault+0xac/0xd0
[13349.660564] do_mem_abort+0x44/0xa0
[13349.661275] el0_da+0x40/0x78
[13349.661888] el0_sync_handler+0xd8/0x120
[13349.662744] memory: usage 2560000kB, limit 2560000kB, failcnt 0
[13349.663826] memory+swap: usage 2560000kB, limit 2560000kB, failcnt 22450
[13349.665033] kmem: usage 8736kB, limit 9007199254740988kB, failcnt 0
[13349.666153] Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod71cedea9_e04a_4bb8_b615_6b4e27698c1c.slice/cri-containerd-543b6573ecc932c15929c6ac2508ebcb72a52eddf24475590afbe13fe4bdcc84.scope:
[13349.666998] anon 2612391936
[13349.666998] file 135168
[13349.666998] kernel_stack 147456
[13349.666998] percpu 0
[13349.666998] sock 0
[13349.666998] shmem 0
[13349.666998] file_mapped 0
[13349.666998] file_dirty 0
[13349.666998] file_writeback 0
[13349.666998] anon_thp 0
[13349.666998] inactive_anon 2612256768
[13349.666998] active_anon 0
[13349.666998] inactive_file 98304
[13349.666998] active_file 0
[13349.666998] unevictable 0
[13349.666998] slab_reclaimable 1049680
[13349.666998] slab_unreclaimable 0
[13349.666998] slab 1049680
[13349.666998] workingset_refault_anon 0
[13349.666998] workingset_refault_file 0
[13349.666998] workingset_activate_anon 0
[13349.666998] workingset_activate_file 0
[13349.666998] workingset_restore_anon 0
[13349.666998] workingset_restore_file 0
[13349.666998] workingset_nodereclaim 0
[13349.666998] pgfault 3747711
[13349.666998] pgmajfault 0
[13349.666998] pgrefill 0
[13349.687594] Tasks state (memory values in pages):
[13349.688440] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[13349.690105] [ 77488] 0 77488 1440016 668293 7532544 0 679 alloy
[13349.691581] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-543b6573ecc932c15929c6ac2508ebcb72a52eddf24475590afbe13fe4bdcc84.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod71cedea9_e04a_4bb8_b615_6b4e27698c1c.slice/cri-containerd-543b6573ecc932c15929c6ac2508ebcb72a52eddf24475590afbe13fe4bdcc84.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod71cedea9_e04a_4bb8_b615_6b4e27698c1c.slice/cri-containerd-543b6573ecc932c15929c6ac2508ebcb72a52eddf24475590afbe13fe4bdcc84.scope,task=alloy,pid=77488,uid=0
[13349.701478] Memory cgroup out of memory: Killed process 77488 (alloy) total-vm:5760064kB, anon-rss:2550676kB, file-rss:122496kB, shmem-rss:0kB, UID:0 pgtables:7356kB oom_score_adj:679
Can you share a heap pprof dump? curl http://localhost:12345/debug/pprof/heap -o heap.pprof
either here on dm me on the community slack?
I was just collecting pprof outputs, and trying to compare them (this is the first time I'm looking at pprof). I've unfortunately overwritten the raw dumps, here are two png's generated from them.
- from alloy, a few seconds before it OOM'd
- from grafana-agent, in the prod environment, with same type of feed, but higher traffic
Turning exemplars off in the spanmetrics connector seems to have stabilized the memory consumption:
I don't see the spanmetrics connector in the pprof of the grafana-agent, are you running the same exact configs?
This looks related to an issue in the OTel repo, which is fixed in v0.99. We will need to update the OTel dependency in Alloy to pick up the fix.
I don't see the spanmetrics connector in the pprof of the grafana-agent, are you running the same exact configs?
Regarding the spanmetrics pipeline: yes, same config. I assume it was among the dropped nodes due to low memory impact.
This issue has not had any activity in the past 30 days, so the needs-attention
label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention
label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!
This issue should now be resolved because Alloy is now using a new OTel version with the bugfix mentioned above.