alloy Possible memory leak in Alloy

What's wrong?

I tried migrating from grafana-agent to alloy, while keeping the flow configuration. While grafana-agent was using ~1.1GB of memory, alloy needs much, much more, and the pod regularly hits OOM.

The instances between 15:00 and 16:30 ran without GOMEMLIMIT, and with higher memory limits. Currently running with the following settings:

  extraEnv:
    - name: GOMEMLIMIT
      value: 2000MiB
  resources:
    requests:
      memory: 2500Mi
      cpu: 1
    limits:
      memory: 2500Mi

We use grafana-agent / alloy only for downsampling traces, and additionally generating spanmetrics and service graph. Config is attached below. This config (with minimal differences, we used a batch processor in front of the servicegraph connector to work around the missing metrics_flush_interval option) has worked in a stable way with grafana-agent.

Steps to reproduce

Deploy alloy in kubernetes, with helm, as a drop in replacement for grafana-agent.

Feed traces from mimir/tempo/loki/grafana/prometheus

Watch as memory usage grows until pod is killed with OOM.

System information

Linux 5.10.209 aarch64 on EKS

Software version

Grafana Alloy v1.0.0

Configuration

tracing {
  sampling_fraction = 1.0

  write_to = [
    otelcol.processor.k8sattributes.enrich.input,
    otelcol.processor.transform.prepare_spanmetrics.input,
    otelcol.connector.servicegraph.graph.input,
  ]
}

otelcol.receiver.otlp "main" {
  grpc {
    endpoint = "0.0.0.0:4317"
  }

  http {
    endpoint = "0.0.0.0:4318"
  }

  output {
    traces = [
      otelcol.processor.k8sattributes.enrich.input,
      otelcol.processor.transform.prepare_spanmetrics.input,
      otelcol.connector.servicegraph.graph.input,
    ]
  }
}

otelcol.receiver.jaeger "main" {
  protocols {
    grpc {
      endpoint = "0.0.0.0:14250"
    }

    thrift_http {
      endpoint = "0.0.0.0:14268"
    }

    thrift_compact {
      endpoint = "0.0.0.0:6831"
    }
  }

  output {
    traces = [
      otelcol.processor.k8sattributes.enrich.input,
      otelcol.processor.transform.prepare_spanmetrics.input,
      otelcol.connector.servicegraph.graph.input,
    ]
  }
}

otelcol.processor.k8sattributes "enrich" {
  output {
    traces = [otelcol.processor.tail_sampling.trace_downsample.input]
  }
}

otelcol.connector.servicegraph "graph" {
  dimensions = ["http.method", "db.system"]

  metrics_flush_interval = "60s"

  store {
    ttl = "30s"
  }

  output {
    metrics = [otelcol.processor.attributes.manual_tagging.input]
  }
}

otelcol.processor.transform "prepare_spanmetrics" {
  error_mode = "ignore"

  trace_statements {
    context = "resource"
    statements = [
      `keep_keys(attributes, ["service.name"])`,
    ]
  }

  output {
    traces = [otelcol.connector.spanmetrics.stats.input]
  }
}

otelcol.connector.spanmetrics "stats" {
  histogram {
    unit = "ms"
    exponential {
    }
  }

  exemplars {
    enabled = true
  }

  aggregation_temporality = "CUMULATIVE"
  namespace = "traces_spanmetrics_"
  metrics_flush_interval = "60s"

  output {
    metrics = [otelcol.processor.attributes.manual_tagging.input]
  }
}

otelcol.processor.tail_sampling "trace_downsample" {
  policy {
    name = "include-all-errors"
    type = "status_code"

    status_code {
      status_codes = ["ERROR"]
    }
  }

  policy {
    name = "include-all-slow-traces"
    type = "latency"

    latency {
      threshold_ms = 5000
    }
  }

  policy {
    name = "include-all-diagnostic-mode-traces"
    type = "boolean_attribute"

    boolean_attribute {
      key   = "diagnostics_mode"
      value = true
    }
  }

  policy {
    name = "downsample-all-others"
    type = "probabilistic"

    probabilistic {
      sampling_percentage = 1
    }
  }

  output {
    traces = [otelcol.processor.attributes.manual_tagging.input]
  }
}

otelcol.processor.attributes "manual_tagging" {

  action {
    action = "insert"
    key    = "garden"
    value  = "global"
  }


  action {
    action = "insert"
    key    = "generated_by"
    value  = "grafana_alloy"
  }

  action {
    action = "insert"
    key    = "grafana_alloy_hostname"
    value  = constants.hostname
  }

  output {
    traces = [otelcol.processor.batch.main.input]
    metrics = [
      otelcol.processor.batch.main.input,
    ]
  }
}

otelcol.processor.batch "main" {
  output {
    metrics = [otelcol.exporter.prometheus.global.input]
    traces  = [otelcol.exporter.otlphttp.tempo.input]
  }
}

otelcol.exporter.otlphttp "tempo" {
  client {
    endpoint = "http://tempo-distributor:4318"
  }
  sending_queue {
    queue_size = 20000
  }
  retry_on_failure {
    enabled          = true
    max_elapsed_time = "2h"
    max_interval     = "1m"
  }
}

otelcol.exporter.prometheus "global" {
  include_target_info = false
  resource_to_telemetry_conversion = true
  gc_frequency = "1h"
  forward_to = [prometheus.remote_write.global.receiver]
}

prometheus.remote_write "global" {
  endpoint {
    url = "http://mimir-distributor:8080/api/v1/push"
    send_native_histograms = true
  }
}

Logs

Containers:
  alloy:
    Container ID:  containerd://543b6573ecc932c15929c6ac2508ebcb72a52eddf24475590afbe13fe4bdcc84
    Image:         docker.io/grafana/alloy:v1.0.0
    Image ID:      docker.io/grafana/alloy@sha256:21248ad12831ad8f7279eb40ecd161b2574c2194ca76e7413996666d09beef6c
    Ports:         12345/TCP, 4317/TCP, 4318/TCP, 14250/TCP, 14268/TCP, 6831/UDP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/UDP
    Args:
      run
      /etc/alloy/config.alloy
      --storage.path=/tmp/alloy
      --server.http.listen-addr=0.0.0.0:12345
      --server.http.ui-path-prefix=/
      --stability.level=generally-available
    State:          Running
      Started:      Wed, 24 Apr 2024 20:46:12 +0200
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 24 Apr 2024 20:37:28 +0200
      Finished:     Wed, 24 Apr 2024 20:45:39 +0200
    Ready:          True
    Restart Count:  3
    Limits:
      memory:  2500Mi
    Requests:
      cpu:      1
      memory:   2500Mi
    Readiness:  http-get http://:12345/-/ready delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      ALLOY_DEPLOY_MODE:  helm
      HOSTNAME:            (v1:spec.nodeName)
      GOMEMLIMIT:         2000MiB
    Mounts:
      /etc/alloy from config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4b49j (ro)
  config-reloader:
    Container ID:  containerd://ebf2f2409762dccace3af437344667fc73318507e59f6ab38f217813a91224eb
    Image:         ghcr.io/jimmidyson/configmap-reload:v0.12.0
    Image ID:      ghcr.io/jimmidyson/configmap-reload@sha256:a7c754986900e41fc47656bdc8dfce33227112a7cce547e0d9ef5d279f4f8e99
    Port:          <none>
    Host Port:     <none>
    Args:
      --volume-dir=/etc/alloy
      --webhook-url=http://localhost:12345/-/reload
    State:          Running
      Started:      Wed, 24 Apr 2024 20:20:47 +0200
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     5Mi
    Environment:  <none>
    Mounts:
      /etc/alloy from config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4b49j (ro)

Apr 24 '24 18:04 madaraszg-tulip

Additional logs from the oom killer:

[13349.644248] alloy invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=679
[13349.645913] CPU: 1 PID: 77501 Comm: alloy Not tainted 5.10.209-198.858.amzn2.aarch64 #1
[13349.647382] Hardware name: Amazon EC2 m7g.large/, BIOS 1.0 11/1/2018
[13349.648527] Call trace:
[13349.648976]  dump_backtrace+0x0/0x204
[13349.649715]  show_stack+0x1c/0x24
[13349.650387]  dump_stack+0xe4/0x12c
[13349.651052]  dump_header+0x4c/0x1f0
[13349.651769]  oom_kill_process+0x24c/0x250
[13349.652594]  out_of_memory+0xdc/0x344
[13349.653326]  mem_cgroup_out_of_memory+0x130/0x148
[13349.654279]  try_charge+0x55c/0x5cc
[13349.654967]  mem_cgroup_charge+0x80/0x240
[13349.655772]  do_anonymous_page+0xb8/0x574
[13349.656580]  handle_pte_fault+0x1a0/0x218
[13349.657379]  __handle_mm_fault+0x1e0/0x380
[13349.658210]  handle_mm_fault+0xcc/0x230
[13349.659000]  do_page_fault+0x14c/0x410
[13349.659723]  do_translation_fault+0xac/0xd0
[13349.660564]  do_mem_abort+0x44/0xa0
[13349.661275]  el0_da+0x40/0x78
[13349.661888]  el0_sync_handler+0xd8/0x120
[13349.662744] memory: usage 2560000kB, limit 2560000kB, failcnt 0
[13349.663826] memory+swap: usage 2560000kB, limit 2560000kB, failcnt 22450
[13349.665033] kmem: usage 8736kB, limit 9007199254740988kB, failcnt 0
[13349.666153] Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod71cedea9_e04a_4bb8_b615_6b4e27698c1c.slice/cri-containerd-543b6573ecc932c15929c6ac2508ebcb72a52eddf24475590afbe13fe4bdcc84.scope:
[13349.666998] anon 2612391936
[13349.666998] file 135168
[13349.666998] kernel_stack 147456
[13349.666998] percpu 0
[13349.666998] sock 0
[13349.666998] shmem 0
[13349.666998] file_mapped 0
[13349.666998] file_dirty 0
[13349.666998] file_writeback 0
[13349.666998] anon_thp 0
[13349.666998] inactive_anon 2612256768
[13349.666998] active_anon 0
[13349.666998] inactive_file 98304
[13349.666998] active_file 0
[13349.666998] unevictable 0
[13349.666998] slab_reclaimable 1049680
[13349.666998] slab_unreclaimable 0
[13349.666998] slab 1049680
[13349.666998] workingset_refault_anon 0
[13349.666998] workingset_refault_file 0
[13349.666998] workingset_activate_anon 0
[13349.666998] workingset_activate_file 0
[13349.666998] workingset_restore_anon 0
[13349.666998] workingset_restore_file 0
[13349.666998] workingset_nodereclaim 0
[13349.666998] pgfault 3747711
[13349.666998] pgmajfault 0
[13349.666998] pgrefill 0
[13349.687594] Tasks state (memory values in pages):
[13349.688440] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[13349.690105] [  77488]     0 77488  1440016   668293  7532544        0           679 alloy
[13349.691581] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-543b6573ecc932c15929c6ac2508ebcb72a52eddf24475590afbe13fe4bdcc84.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod71cedea9_e04a_4bb8_b615_6b4e27698c1c.slice/cri-containerd-543b6573ecc932c15929c6ac2508ebcb72a52eddf24475590afbe13fe4bdcc84.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod71cedea9_e04a_4bb8_b615_6b4e27698c1c.slice/cri-containerd-543b6573ecc932c15929c6ac2508ebcb72a52eddf24475590afbe13fe4bdcc84.scope,task=alloy,pid=77488,uid=0
[13349.701478] Memory cgroup out of memory: Killed process 77488 (alloy) total-vm:5760064kB, anon-rss:2550676kB, file-rss:122496kB, shmem-rss:0kB, UID:0 pgtables:7356kB oom_score_adj:679

Apr 24 '24 19:04 madaraszg-tulip

Can you share a heap pprof dump? curl http://localhost:12345/debug/pprof/heap -o heap.pprof either here on dm me on the community slack?

Apr 24 '24 19:04 mattdurham

I was just collecting pprof outputs, and trying to compare them (this is the first time I'm looking at pprof). I've unfortunately overwritten the raw dumps, here are two png's generated from them.

from alloy, a few seconds before it OOM'd

alloy-pprof

from grafana-agent, in the prod environment, with same type of feed, but higher traffic

grafana-agent-pprof

Apr 24 '24 19:04 madaraszg-tulip

Turning exemplars off in the spanmetrics connector seems to have stabilized the memory consumption:

Apr 24 '24 21:04 madaraszg-tulip

I don't see the spanmetrics connector in the pprof of the grafana-agent, are you running the same exact configs?

Apr 25 '24 08:04 wildum

This looks related to an issue in the OTel repo, which is fixed in v0.99. We will need to update the OTel dependency in Alloy to pick up the fix.

Apr 25 '24 09:04 ptodev

I don't see the spanmetrics connector in the pprof of the grafana-agent, are you running the same exact configs?

Regarding the spanmetrics pipeline: yes, same config. I assume it was among the dropped nodes due to low memory impact.

Apr 25 '24 15:04 madaraszg-tulip

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!

May 28 '24 00:05 github-actions[bot]

This issue should now be resolved because Alloy is now using a new OTel version with the bugfix mentioned above.

May 29 '24 08:05 ptodev

alloy alloy copied to clipboard

Possible memory leak in Alloy

What's wrong?

Steps to reproduce

System information

Software version

Configuration

Logs

alloy
alloy copied to clipboard