agent icon indicating copy to clipboard operation
agent copied to clipboard

Consume a lot of CPU resources when idling

Open Luchanso opened this issue 3 years ago • 14 comments

Hi, I want use grafana agent on low CPU VM (5$ DigitalOcean) with other application runnig on this machine. I worry about resources, and don't undestrand why Grafana Agent consume so much CPU without any load, maybe I can fix my configuration? image

config.yml:

prometheus:
  configs:
    - name: integrations
      remote_write:
        - basic_auth:
            password: bla
            username: 123
          url: https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push
      scrape_configs:
        - job_name: apollo-server-integration
          static_configs:
            - targets: ['backend:80']
  global:
    scrape_interval: 60s

and I run it with docker-compose docker-compose.yml:

version: "3.9"
services:
  grafana_agent:
    container_name: grafana-agent
    image: grafana/agent:v0.21.1
    volumes:
      - ./tmp/agent:/etc/agent/dat
      - ./config.yml:/etc/agent/agent.yaml
    networks:
      - app
networks:
  app:
    external: true
    name: app

Luchanso avatar Dec 05 '21 11:12 Luchanso

little remark: I run grafana agent on local machine, with Core I5-8600 3.1 GHz In container (I installed htop) image

Luchanso avatar Dec 05 '21 11:12 Luchanso

Hey! 6% does seem a bit high for the config you posted, and we'll need to do a little bit of digging to figure out why your deployment uses that much CPU.

Can you generate a 29-second CPU profile for us and upload it here? (29 seconds because >=30 is rejected by default). You can do that with a wget of the agent's /debug/pprof/profile?seconds=29 endpoint. If you haven't changed the http_listen_port of the agent, it will be listening on port 80 by default.

Also: have you noticed any errors or warnings in the agent's log messages that might point to something being misconfigured?

rfratto avatar Dec 06 '21 14:12 rfratto

@rfratto thank you for answer, I don't time now for checking, but you can do it self:

git pull https://github.com/Luchanso/grafana-agent-example
docker-compose up
docker exec -it grafana-agent bash

maybe later I will check

Luchanso avatar Dec 07 '21 09:12 Luchanso

I'm also quite busy :) We can leave this ticket open until you have some time for us to work together on this.

rfratto avatar Dec 07 '21 11:12 rfratto

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 09 '22 03:01 stale[bot]

The high CPU usage is impacting us too. Please let me know if has any details that I can bring to help with it.

edinardo avatar Jan 26 '22 17:01 edinardo

At the moment, we'd still benefit from a CPU profile to be able to track down what might be causing the high CPU usage.

rfratto avatar Jan 26 '22 18:01 rfratto

I am uploading 2 CPU profiles. idle.pb.gz is on an idle instance of the agent based on this configuration, resulting in similar CPU usage described above. load.pb.gz adds logs and traces being ingested, where the CPU usage is higher compared with other agents (DD for instance) with a similar load.

edinardo avatar Jan 26 '22 18:01 edinardo

Thanks! We'll take a first look at this tomorrow.

rfratto avatar Jan 27 '22 01:01 rfratto

We looked into this and haven't yet found anything that would lead to unusual CPU usage.

I'm marking this as keepalive for now. I think this is going to need quite a bit of investigative work, and it seems like the minimum required CPU utilization might be unexpectedly high for low metric workloads.

rfratto avatar Feb 02 '22 15:02 rfratto

Hey ! I experience the same issue. You can find the profile attached.

And the agent config just below. image

I wonder if the CPU usage could be due to the distance with the Grafana servers. My server runs in France...

profile.gz

I experience the same thing when I launch a prometheus instance in agent mode with remote_write. However, when I don't enable remote_write, the CPU is ok

ThomasLachaux avatar Feb 10 '22 22:02 ThomasLachaux

Same here... n times more cpu than nxlog and vmtoolsd (about 100MHz) RHEL7.9 ESXi 6.7 build 19898906

ps: running as unprivileged user; using cribl before prometheus (send to newrelic, elastic etc)

image

server:
  log_level: error
metrics:
  wal_directory: /tmp/wal
  global:
    scrape_interval: 60s
    remote_write:
    - url: http://X.Y.W.Z:54321/write
integrations:
  node_exporter:
    enabled: true
    set_collectors:
    - cpu
    - cpufreq
    - devstat
    - diskstats
    - filesystem
    - loadavg
    - meminfo
    - netclass
    - netdev

CPU profile attached profile.zip

dvaguirre avatar Jul 18 '22 06:07 dvaguirre

maybe something related with this? any chance to disable its own telemetry? image

dvaguirre avatar Jul 21 '22 15:07 dvaguirre

You can disable the telemetry: https://grafana.com/docs/agent/latest/configuration/flags/#report-use-of-feature-flags

My understanding is that the brief CPU spikes likely come from Prometheus remote_write, but we haven't had the time to investigate yet.

rfratto avatar Jul 26 '22 17:07 rfratto

Observing similar situation on multiple Linux instances (idling most of the time) and grafana-agent installed from official repo:

# dpkg -s grafana-agent
Package: grafana-agent
Status: install ok installed
Priority: extra
Section: default
Installed-Size: 284069
Maintainer: <@ccd16eb532f7>
Architecture: amd64
Version: 0.29.0-1
Conffiles:
 /etc/default/grafana-agent 27addbe66f569413656281612a2c4758
 /etc/grafana-agent.yaml f429c117067007ea23badc22873abaa2
Description: no description given
License: Apache 2.0
Vendor: Grafana Labs
Homepage: https://github.com/grafana/agent

Configuration is pretty simple for now:

integrations:
  node_exporter:
    enabled: true
    relabel_configs:
    - replacement: XXX
      target_label: instance
  prometheus_remote_write:
  - basic_auth:
      password: XXX
      username: XXX
    url: https://XXX/api/prom/push
logs:
  configs:
  - clients:
    - basic_auth:
        password: XXX
        username: XXX
      url: https://XXX/loki/api/v1/push
    name: integrations
    positions:
      filename: /tmp/positions.yaml
    scrape_configs:
    - job_name: integrations/node_exporter_direct_scrape
      static_configs:
      - labels:
          __path__: /var/log/{auth,dpkg,kern,mail}.log
          instance: XXX
          job: integrations/node_exporter
        targets:
        - localhost
    target_config:
      sync_period: 10s
metrics:
  configs:
  - name: integrations
    remote_write:
    - basic_auth:
        password: XXX
        username: XXX
      url: https://XXX/api/prom/push
  global:
    scrape_interval: 60s
  wal_directory: /tmp/grafana-agent-wal

syscall top from strace(1) for a few seconds:

# strace -cfp 124063
strace: Process 124063 attached with 11 threads
strace: Process 124063 detached
strace: Process 124067 detached
strace: Process 124068 detached
strace: Process 124069 detached
strace: Process 124070 detached
strace: Process 124071 detached
strace: Process 124121 detached
strace: Process 124122 detached
strace: Process 124127 detached
strace: Process 124138 detached
strace: Process 414683 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 50.47    9.175278         351     26122      7680 futex
 18.81    3.419738          94     36276        10 epoll_pwait
 18.48    3.360104        3353      1002       384 ppoll
  9.72    1.767663          42     41430           nanosleep
  0.92    0.168039          21      7685       637 read
  0.27    0.048309          30      1594        18 newfstatat
  0.22    0.039707          29      1332           getdents64
  0.20    0.035653          42       839        16 openat
  0.18    0.031978          33       950           tgkill
  0.15    0.027903          29       950       415 rt_sigreturn
  0.14    0.025047          26       950           getpid
  0.10    0.018978          29       639           write
  0.10    0.018183          18       965       679 epoll_ctl
  0.10    0.017771          21       823           close
  0.05    0.009116          25       352           sched_yield
  0.05    0.008370          18       448           fcntl
  0.02    0.003250         464         7           fsync
  0.01    0.002478          25        99           fstat
  0.01    0.001016          21        48           lseek
  0.00    0.000859          61        14           renameat
  0.00    0.000279          46         6           statfs
  0.00    0.000275          39         7           unlinkat
  0.00    0.000050          25         2           readlinkat
  0.00    0.000044          44         1           adjtimex
  0.00    0.000000           0         1           uname
------ ----------- ----------- --------- --------- ----------------
100.00   18.180088                122542      9839 total

pprof top obtained from "support bundle" data (CPU, mutex):

# go tool pprof /usr/bin/grafana-agent cpu.pprof 
File: grafana-agent
Build ID: c9755d8e39a6a429651a15fc20b517d7813a7eb8
Type: cpu
Time: Nov 24, 2022 at 6:41am (UTC)
Duration: 29.79s, Total samples = 380ms ( 1.28%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 270ms, 71.05% of 380ms total
Showing top 10 nodes out of 118
      flat  flat%   sum%        cum   cum%
     120ms 31.58% 31.58%      120ms 31.58%  runtime.futex
      30ms  7.89% 39.47%       30ms  7.89%  runtime.epollwait
      30ms  7.89% 47.37%       30ms  7.89%  runtime/internal/syscall.Syscall6
      20ms  5.26% 52.63%       40ms 10.53%  runtime.gentraceback
      20ms  5.26% 57.89%       50ms 13.16%  runtime.netpoll
      10ms  2.63% 60.53%       20ms  5.26%  fmt.(*pp).printArg
      10ms  2.63% 63.16%       10ms  2.63%  regexp.(*bitState).shouldVisit
      10ms  2.63% 65.79%       20ms  5.26%  runtime.(*itabTableType).find
      10ms  2.63% 68.42%       10ms  2.63%  runtime.(*randomEnum).next
      10ms  2.63% 71.05%       10ms  2.63%  runtime.casgstatus
(pprof)

# go tool pprof /usr/bin/grafana-agent mutex.pprof 
File: grafana-agent
Build ID: c9755d8e39a6a429651a15fc20b517d7813a7eb8
Type: delay
Time: Nov 24, 2022 at 6:41am (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 329.56us, 100% of 329.56us total
      flat  flat%   sum%        cum   cum%
  329.56us   100%   100%   329.56us   100%  sync.(*Mutex).Unlock
         0     0%   100%   329.56us   100%  github.com/go-kit/log.(*context).Log
         0     0%   100%   329.56us   100%  github.com/go-kit/log.(*syncLogger).Log
         0     0%   100%   329.56us   100%  github.com/grafana/agent/pkg/server.(*HookLogger).Log
         0     0%   100%   329.56us   100%  github.com/grafana/agent/pkg/server.(*Logger).Log
         0     0%   100%   170.53us 51.75%  github.com/prometheus/node_exporter/collector.(*diskstatsCollector).Update
         0     0%   100%   329.56us   100%  github.com/prometheus/node_exporter/collector.NodeCollector.Collect.func1
         0     0%   100%   329.56us   100%  github.com/prometheus/node_exporter/collector.execute

Appreciate any pointers. Currently, grafana-agent is top CPU consumer among other running services which is weird.

defanator avatar Nov 24 '22 08:11 defanator

At the moment, we're pretty sure that the CPU spikes are coming from Prometheus' remote_write, which doesn't seem to be optimized well for very low metric workloads. @cstyan theorized offline that there's a lot of mutex unlocking which could be responsible for the spikes, which does appear to be the case in the profiles you've listed above.

Assuming that's true and the mutexes in remote_write are responsible for the CPU spikes, there will need to be an upstream commit to Prometheus to optimize remote_write so it doesn't need to lock as much.

rfratto avatar Nov 24 '22 16:11 rfratto

@defanator is the CPU usage you're seeing spiky? Or is it constantly elevated?

If you temporarily remove the remote_write sections and reload/restart the agent, do you see CPU usage decrease?

rfratto avatar Nov 24 '22 16:11 rfratto

I just did some tests locally and found at least one of the causes for the high CPU utilization is Prometheus' WAL watcher. It's hard coded to read the WAL every 10ms, which is way too fast for low scrape loads.

I played around with lowering the frequency to once per second and my CPU utilization dropped from ~4% to ~0.1%.

@codesome Is there a reason why the WAL watcher couldn't use fsnotify for WAL segments? Alternatively I could pass a channel to the WAL watcher to inform it when new samples are written.

rfratto avatar Nov 24 '22 17:11 rfratto

@rfratto thanks for your responses.

  1. Removing both remote_write blocks from the configuration in question makes grafana-agent to consume less CPU (process almost disappears from top/htop). However, strace top remains the same:
# strace -cfp 2245842
strace: Process 2245842 attached with 10 threads
strace: pid 2245845: entering, ptrace_syscall_info.op == 2
^Cstrace: Process 2245842 detached
strace: Process 2245843 detached
strace: Process 2245844 detached
strace: Process 2245845 detached
strace: Process 2245846 detached
strace: Process 2245847 detached
strace: Process 2245848 detached
strace: Process 2245849 detached
strace: Process 2246043 detached
strace: Process 2246254 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 45.65    0.380908         211      1800       536 futex
 23.62    0.197079       65693         3           epoll_wait
  9.79    0.081700          28      2907           epoll_pwait
  9.42    0.078621           7     10190           clock_gettime
  9.12    0.076087          51      1477           nanosleep
  1.34    0.011179          12       920         8 newfstatat
  0.25    0.002125         531         4           fsync
  0.19    0.001545          12       119           write
  0.12    0.001021           8       122         1 read
  0.10    0.000856          15        54           tgkill
  0.07    0.000608          10        60           sched_yield
  0.07    0.000606          11        54           rt_sigreturn
  0.07    0.000605          11        54           getpid
  0.05    0.000398          24        16           openat
  0.03    0.000262          32         8           renameat
  0.03    0.000214           5        36           fstat
  0.02    0.000187           5        36           lseek
  0.02    0.000129           8        16           close
  0.02    0.000128          16         8           getdents64
  0.01    0.000094           5        16        16 epoll_ctl
  0.01    0.000055          13         4           unlinkat
  0.00    0.000000           0         1           restart_syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.834407                 17905       561 total
  1. With remote_write blocks enabled, I'm not seeing any noticeable spikes, it's just constantly using more CPU on an idle machine.

Also, below are strace top for ~10 seconds of tracing all threads, remote_write disabled:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 60.63    0.128456         250       512       149 futex
 13.24    0.028044           9      2858           clock_gettime
 12.64    0.026778          34       783           epoll_pwait
 10.05    0.021304          46       454           nanosleep
  1.90    0.004031          16       240         2 newfstatat
  0.30    0.000636         636         1           fsync
  0.23    0.000497          16        31           write
  0.16    0.000336          22        15           tgkill
  0.16    0.000332          11        30           read
  0.15    0.000320          14        22           sched_yield
  0.12    0.000254          16        15           getpid
  0.11    0.000239          15        15           rt_sigreturn
  0.08    0.000162          40         4           openat
  0.07    0.000158          79         2           renameat
  0.04    0.000086           9         9           fstat
  0.03    0.000068          17         4           close
  0.03    0.000057           7         8           lseek
  0.02    0.000048          12         4         4 epoll_ctl
  0.02    0.000042          21         2           getdents64
  0.01    0.000030          30         1           unlinkat
  0.00    0.000000           0         1           restart_syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.211878                  5011       155 total

remote_write enabled:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 48.31    0.989828         261      3790      1076 futex
 16.35    0.334982      334982         1           epoll_wait
 11.53    0.236193          41      5653           epoll_pwait
 10.91    0.223487          63      3544           nanosleep
 10.19    0.208711           9     21684           clock_gettime
  0.91    0.018647          14      1264        14 read
  0.33    0.006784          25       261        17 openat
  0.26    0.005348          13       388           fcntl
  0.25    0.005053          13       364       124 epoll_ctl
  0.19    0.003919          13       281           sched_yield
  0.19    0.003834          15       245           getdents64
  0.18    0.003662          14       261         8 newfstatat
  0.16    0.003290          13       244           close
  0.07    0.001361          24        56           tgkill
  0.04    0.000823          14        56           getpid
  0.03    0.000684          12        56           rt_sigreturn
  0.03    0.000677          15        44           write
  0.02    0.000505         505         1           fsync
  0.02    0.000443          21        21         1 statfs
  0.02    0.000440          12        35           fstat
  0.00    0.000093          46         2           renameat
  0.00    0.000056          56         1           restart_syscall
  0.00    0.000041          41         1           unlinkat
  0.00    0.000019           2         8           lseek
  0.00    0.000015          15         1           adjtimex
  0.00    0.000013          13         1           uname
------ ----------- ----------- --------- --------- ----------------
100.00    2.048908                 38263      1240 total

(I was a bit surprised by difference in futex && clock_gettime invocations count.)

defanator avatar Nov 25 '22 10:11 defanator

@codesome Is there a reason why the WAL watcher couldn't use fsnotify for WAL segments? Alternatively I could pass a channel to the WAL watcher to inform it when new samples are written.

That is something @cstyan or @csmarchbanks would know. (I might be commenting this late already, I faintly remember you discussing with them some time ago)

codesome avatar Dec 14 '22 05:12 codesome

When originally implementing the WAL based remote write changes we found fsnotify very unreliable. I don't remember the details since this was more than 4 years ago, perhaps it was just due to the number of events that are sent for write events and we lost events for new segment creation, but the WAL based approach was not going to work correctly using fsnotify.

cstyan avatar Jan 23 '23 16:01 cstyan

When originally implementing the WAL based remote write changes we found fsnotify very unreliable.

I found the same in testing out an fsnotify approach. See my comment on prometheus/prometheus#11625:

I played around with an fsnotify approach and couldn't get it to work as expected. At least on my Mac, fsnotify was only emitting events every ~60 seconds with a 5 second scrape interval. I can't figure out why, which isn't giving me much faith in the fsnotify-based approach.

rfratto avatar Jan 24 '23 11:01 rfratto

same issue here, can you provide a config option to change the frequency?

sanyatuning avatar Feb 02 '23 09:02 sanyatuning

Just wanted to confirm seeing the same issue in the flow version of the agent.

hanikesn avatar Jun 18 '23 19:06 hanikesn

A fix for this has been merged into Prometheus (prometheus/prometheus#11949). We'll be able to bring it downstream into the agent when Prometheus 2.45 is available.

rfratto avatar Jun 19 '23 21:06 rfratto

This change has been ported over in https://github.com/grafana/agent/pull/3942/, so closing this as completed 🙌

tpaschalis avatar Sep 27 '23 08:09 tpaschalis

I am using v0.36.2 static mode and using grafana-agent as a sidecar to airflow workers. I am observing a constant upward trend of CPU consumption over a period of time. Yes, prometheus_remote_write is configured as well. The sidecar grafana-agent is using the below config:

    integrations:
      prometheus_remote_write:
      - basic_auth:
          username: <VICTORIAMETRIC-USERNAME>
          password: <VICTORIAMETRIC-PASSWORD>
        url: <VICTORIAMETRIC-URL>
      agent:
        enabled: true
        relabel_configs:
        - action: replace
          source_labels:
          - agent_hostname
          target_label: instance
    logs:
      global:
      configs:
      - name: integrations
        clients:
        - url: <LOKI-URL>
          tenant_id: <some-tenant>
          basic_auth:
            username: <LOKI-USERNAME>
            password: <LOKI-PASSWORD>
        positions:
          filename: /tmp/positions.yaml
        scrape_configs:
        - job_name: integrations/apache-airflow
          static_configs:
          - targets: [localhost]
            labels:
              job: integrations/apache-airflow
              instance: <INSTANCE_NAME>
              pod_name: <POD_NAME>
              __path__: /logs/dag_id=*/**/*.log
          pipeline_stages:
          - match:
              selector: '{job="integrations/apache-airflow",instance="<INSTANCE_NAME>"}'
              stages:
              - regex:
                  source: filename
                  expression: "/logs/dag_id=(?P<dag_id>\\S+?)/run_id=(?P<run_id>\\S+?)/task_id=(?P<task_id>\\S+?)(?:/map_index=(?P<map_index>\\d+))?(?:/attempt=(?P<attempt>\\d+))?\\.log"
              - labels:
                  dag_id:
                  task_id:
                  map_index:
                  run_id:
                  attempt:
          - multiline:
              # match on timestamp. Format should be like '[2023-05-09T11:58:53.205+0000]'
              firstline: '\\[\\d+-\\d+-\\d+T\\d+:\\d+:\\d+\\.\\d+\\+\\d+\\]'
    metrics:
      global:
        scrape_interval: 60s
      wal_directory: /tmp/grafana-agent-wal

The files in airflow are deleted after every 9 days. However no visible CPU drop observed.

P.S. - I am also confused here as the milestone is marked as v0.37.0 but the changelog says v0.36.0. If this is supposed to v0.37.0, then it's expected.

Upanshu11 avatar Feb 15 '24 11:02 Upanshu11

Hey there @Upanshu11.

We're no longer actively tracking closed issues; I just happened to see this. If you think there's still new information, could you please open a new issue and attach CPU and goroutine profiles by hitting the pprof endpoints of running agents?

tpaschalis avatar Feb 20 '24 08:02 tpaschalis

Hey @tpaschalis. I realised the error in my code, the files rotation was a little buggy. Fixed it and CPU came down

Upanshu11 avatar Feb 20 '24 14:02 Upanshu11

Good to hear, thanks!

tpaschalis avatar Feb 20 '24 15:02 tpaschalis