agent
agent copied to clipboard
Consume a lot of CPU resources when idling
Hi, I want use grafana agent on low CPU VM (5$ DigitalOcean) with other application runnig on this machine. I worry about resources, and don't undestrand why Grafana Agent consume so much CPU without any load, maybe I can fix my configuration?
config.yml:
prometheus:
configs:
- name: integrations
remote_write:
- basic_auth:
password: bla
username: 123
url: https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push
scrape_configs:
- job_name: apollo-server-integration
static_configs:
- targets: ['backend:80']
global:
scrape_interval: 60s
and I run it with docker-compose docker-compose.yml:
version: "3.9"
services:
grafana_agent:
container_name: grafana-agent
image: grafana/agent:v0.21.1
volumes:
- ./tmp/agent:/etc/agent/dat
- ./config.yml:/etc/agent/agent.yaml
networks:
- app
networks:
app:
external: true
name: app
little remark: I run grafana agent on local machine, with Core I5-8600 3.1 GHz
In container (I installed htop)
Hey! 6% does seem a bit high for the config you posted, and we'll need to do a little bit of digging to figure out why your deployment uses that much CPU.
Can you generate a 29-second CPU profile for us and upload it here? (29 seconds because >=30 is rejected by default). You can do that with a wget
of the agent's /debug/pprof/profile?seconds=29
endpoint. If you haven't changed the http_listen_port
of the agent, it will be listening on port 80 by default.
Also: have you noticed any errors or warnings in the agent's log messages that might point to something being misconfigured?
@rfratto thank you for answer, I don't time now for checking, but you can do it self:
git pull https://github.com/Luchanso/grafana-agent-example
docker-compose up
docker exec -it grafana-agent bash
maybe later I will check
I'm also quite busy :) We can leave this ticket open until you have some time for us to work together on this.
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
The high CPU usage is impacting us too. Please let me know if has any details that I can bring to help with it.
At the moment, we'd still benefit from a CPU profile to be able to track down what might be causing the high CPU usage.
I am uploading 2 CPU profiles. idle.pb.gz is on an idle instance of the agent based on this configuration, resulting in similar CPU usage described above. load.pb.gz adds logs and traces being ingested, where the CPU usage is higher compared with other agents (DD for instance) with a similar load.
Thanks! We'll take a first look at this tomorrow.
We looked into this and haven't yet found anything that would lead to unusual CPU usage.
I'm marking this as keepalive for now. I think this is going to need quite a bit of investigative work, and it seems like the minimum required CPU utilization might be unexpectedly high for low metric workloads.
Hey ! I experience the same issue. You can find the profile attached.
And the agent config just below.
I wonder if the CPU usage could be due to the distance with the Grafana servers. My server runs in France...
I experience the same thing when I launch a prometheus instance in agent mode with remote_write. However, when I don't enable remote_write, the CPU is ok
Same here... n times more cpu than nxlog and vmtoolsd (about 100MHz) RHEL7.9 ESXi 6.7 build 19898906
ps: running as unprivileged user; using cribl before prometheus (send to newrelic, elastic etc)
server:
log_level: error
metrics:
wal_directory: /tmp/wal
global:
scrape_interval: 60s
remote_write:
- url: http://X.Y.W.Z:54321/write
integrations:
node_exporter:
enabled: true
set_collectors:
- cpu
- cpufreq
- devstat
- diskstats
- filesystem
- loadavg
- meminfo
- netclass
- netdev
CPU profile attached profile.zip
maybe something related with this? any chance to disable its own telemetry?
You can disable the telemetry: https://grafana.com/docs/agent/latest/configuration/flags/#report-use-of-feature-flags
My understanding is that the brief CPU spikes likely come from Prometheus remote_write, but we haven't had the time to investigate yet.
Observing similar situation on multiple Linux instances (idling most of the time) and grafana-agent installed from official repo:
# dpkg -s grafana-agent
Package: grafana-agent
Status: install ok installed
Priority: extra
Section: default
Installed-Size: 284069
Maintainer: <@ccd16eb532f7>
Architecture: amd64
Version: 0.29.0-1
Conffiles:
/etc/default/grafana-agent 27addbe66f569413656281612a2c4758
/etc/grafana-agent.yaml f429c117067007ea23badc22873abaa2
Description: no description given
License: Apache 2.0
Vendor: Grafana Labs
Homepage: https://github.com/grafana/agent
Configuration is pretty simple for now:
integrations:
node_exporter:
enabled: true
relabel_configs:
- replacement: XXX
target_label: instance
prometheus_remote_write:
- basic_auth:
password: XXX
username: XXX
url: https://XXX/api/prom/push
logs:
configs:
- clients:
- basic_auth:
password: XXX
username: XXX
url: https://XXX/loki/api/v1/push
name: integrations
positions:
filename: /tmp/positions.yaml
scrape_configs:
- job_name: integrations/node_exporter_direct_scrape
static_configs:
- labels:
__path__: /var/log/{auth,dpkg,kern,mail}.log
instance: XXX
job: integrations/node_exporter
targets:
- localhost
target_config:
sync_period: 10s
metrics:
configs:
- name: integrations
remote_write:
- basic_auth:
password: XXX
username: XXX
url: https://XXX/api/prom/push
global:
scrape_interval: 60s
wal_directory: /tmp/grafana-agent-wal
syscall top from strace(1) for a few seconds:
# strace -cfp 124063
strace: Process 124063 attached with 11 threads
strace: Process 124063 detached
strace: Process 124067 detached
strace: Process 124068 detached
strace: Process 124069 detached
strace: Process 124070 detached
strace: Process 124071 detached
strace: Process 124121 detached
strace: Process 124122 detached
strace: Process 124127 detached
strace: Process 124138 detached
strace: Process 414683 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
50.47 9.175278 351 26122 7680 futex
18.81 3.419738 94 36276 10 epoll_pwait
18.48 3.360104 3353 1002 384 ppoll
9.72 1.767663 42 41430 nanosleep
0.92 0.168039 21 7685 637 read
0.27 0.048309 30 1594 18 newfstatat
0.22 0.039707 29 1332 getdents64
0.20 0.035653 42 839 16 openat
0.18 0.031978 33 950 tgkill
0.15 0.027903 29 950 415 rt_sigreturn
0.14 0.025047 26 950 getpid
0.10 0.018978 29 639 write
0.10 0.018183 18 965 679 epoll_ctl
0.10 0.017771 21 823 close
0.05 0.009116 25 352 sched_yield
0.05 0.008370 18 448 fcntl
0.02 0.003250 464 7 fsync
0.01 0.002478 25 99 fstat
0.01 0.001016 21 48 lseek
0.00 0.000859 61 14 renameat
0.00 0.000279 46 6 statfs
0.00 0.000275 39 7 unlinkat
0.00 0.000050 25 2 readlinkat
0.00 0.000044 44 1 adjtimex
0.00 0.000000 0 1 uname
------ ----------- ----------- --------- --------- ----------------
100.00 18.180088 122542 9839 total
pprof top obtained from "support bundle" data (CPU, mutex):
# go tool pprof /usr/bin/grafana-agent cpu.pprof
File: grafana-agent
Build ID: c9755d8e39a6a429651a15fc20b517d7813a7eb8
Type: cpu
Time: Nov 24, 2022 at 6:41am (UTC)
Duration: 29.79s, Total samples = 380ms ( 1.28%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 270ms, 71.05% of 380ms total
Showing top 10 nodes out of 118
flat flat% sum% cum cum%
120ms 31.58% 31.58% 120ms 31.58% runtime.futex
30ms 7.89% 39.47% 30ms 7.89% runtime.epollwait
30ms 7.89% 47.37% 30ms 7.89% runtime/internal/syscall.Syscall6
20ms 5.26% 52.63% 40ms 10.53% runtime.gentraceback
20ms 5.26% 57.89% 50ms 13.16% runtime.netpoll
10ms 2.63% 60.53% 20ms 5.26% fmt.(*pp).printArg
10ms 2.63% 63.16% 10ms 2.63% regexp.(*bitState).shouldVisit
10ms 2.63% 65.79% 20ms 5.26% runtime.(*itabTableType).find
10ms 2.63% 68.42% 10ms 2.63% runtime.(*randomEnum).next
10ms 2.63% 71.05% 10ms 2.63% runtime.casgstatus
(pprof)
# go tool pprof /usr/bin/grafana-agent mutex.pprof
File: grafana-agent
Build ID: c9755d8e39a6a429651a15fc20b517d7813a7eb8
Type: delay
Time: Nov 24, 2022 at 6:41am (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 329.56us, 100% of 329.56us total
flat flat% sum% cum cum%
329.56us 100% 100% 329.56us 100% sync.(*Mutex).Unlock
0 0% 100% 329.56us 100% github.com/go-kit/log.(*context).Log
0 0% 100% 329.56us 100% github.com/go-kit/log.(*syncLogger).Log
0 0% 100% 329.56us 100% github.com/grafana/agent/pkg/server.(*HookLogger).Log
0 0% 100% 329.56us 100% github.com/grafana/agent/pkg/server.(*Logger).Log
0 0% 100% 170.53us 51.75% github.com/prometheus/node_exporter/collector.(*diskstatsCollector).Update
0 0% 100% 329.56us 100% github.com/prometheus/node_exporter/collector.NodeCollector.Collect.func1
0 0% 100% 329.56us 100% github.com/prometheus/node_exporter/collector.execute
Appreciate any pointers. Currently, grafana-agent is top CPU consumer among other running services which is weird.
At the moment, we're pretty sure that the CPU spikes are coming from Prometheus' remote_write, which doesn't seem to be optimized well for very low metric workloads. @cstyan theorized offline that there's a lot of mutex unlocking which could be responsible for the spikes, which does appear to be the case in the profiles you've listed above.
Assuming that's true and the mutexes in remote_write are responsible for the CPU spikes, there will need to be an upstream commit to Prometheus to optimize remote_write so it doesn't need to lock as much.
@defanator is the CPU usage you're seeing spiky? Or is it constantly elevated?
If you temporarily remove the remote_write sections and reload/restart the agent, do you see CPU usage decrease?
I just did some tests locally and found at least one of the causes for the high CPU utilization is Prometheus' WAL watcher. It's hard coded to read the WAL every 10ms, which is way too fast for low scrape loads.
I played around with lowering the frequency to once per second and my CPU utilization dropped from ~4% to ~0.1%.
@codesome Is there a reason why the WAL watcher couldn't use fsnotify for WAL segments? Alternatively I could pass a channel to the WAL watcher to inform it when new samples are written.
@rfratto thanks for your responses.
- Removing both remote_write blocks from the configuration in question makes grafana-agent to consume less CPU (process almost disappears from top/htop). However, strace top remains the same:
# strace -cfp 2245842
strace: Process 2245842 attached with 10 threads
strace: pid 2245845: entering, ptrace_syscall_info.op == 2
^Cstrace: Process 2245842 detached
strace: Process 2245843 detached
strace: Process 2245844 detached
strace: Process 2245845 detached
strace: Process 2245846 detached
strace: Process 2245847 detached
strace: Process 2245848 detached
strace: Process 2245849 detached
strace: Process 2246043 detached
strace: Process 2246254 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
45.65 0.380908 211 1800 536 futex
23.62 0.197079 65693 3 epoll_wait
9.79 0.081700 28 2907 epoll_pwait
9.42 0.078621 7 10190 clock_gettime
9.12 0.076087 51 1477 nanosleep
1.34 0.011179 12 920 8 newfstatat
0.25 0.002125 531 4 fsync
0.19 0.001545 12 119 write
0.12 0.001021 8 122 1 read
0.10 0.000856 15 54 tgkill
0.07 0.000608 10 60 sched_yield
0.07 0.000606 11 54 rt_sigreturn
0.07 0.000605 11 54 getpid
0.05 0.000398 24 16 openat
0.03 0.000262 32 8 renameat
0.03 0.000214 5 36 fstat
0.02 0.000187 5 36 lseek
0.02 0.000129 8 16 close
0.02 0.000128 16 8 getdents64
0.01 0.000094 5 16 16 epoll_ctl
0.01 0.000055 13 4 unlinkat
0.00 0.000000 0 1 restart_syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.834407 17905 561 total
- With remote_write blocks enabled, I'm not seeing any noticeable spikes, it's just constantly using more CPU on an idle machine.
Also, below are strace top for ~10 seconds of tracing all threads, remote_write disabled:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
60.63 0.128456 250 512 149 futex
13.24 0.028044 9 2858 clock_gettime
12.64 0.026778 34 783 epoll_pwait
10.05 0.021304 46 454 nanosleep
1.90 0.004031 16 240 2 newfstatat
0.30 0.000636 636 1 fsync
0.23 0.000497 16 31 write
0.16 0.000336 22 15 tgkill
0.16 0.000332 11 30 read
0.15 0.000320 14 22 sched_yield
0.12 0.000254 16 15 getpid
0.11 0.000239 15 15 rt_sigreturn
0.08 0.000162 40 4 openat
0.07 0.000158 79 2 renameat
0.04 0.000086 9 9 fstat
0.03 0.000068 17 4 close
0.03 0.000057 7 8 lseek
0.02 0.000048 12 4 4 epoll_ctl
0.02 0.000042 21 2 getdents64
0.01 0.000030 30 1 unlinkat
0.00 0.000000 0 1 restart_syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.211878 5011 155 total
remote_write enabled:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
48.31 0.989828 261 3790 1076 futex
16.35 0.334982 334982 1 epoll_wait
11.53 0.236193 41 5653 epoll_pwait
10.91 0.223487 63 3544 nanosleep
10.19 0.208711 9 21684 clock_gettime
0.91 0.018647 14 1264 14 read
0.33 0.006784 25 261 17 openat
0.26 0.005348 13 388 fcntl
0.25 0.005053 13 364 124 epoll_ctl
0.19 0.003919 13 281 sched_yield
0.19 0.003834 15 245 getdents64
0.18 0.003662 14 261 8 newfstatat
0.16 0.003290 13 244 close
0.07 0.001361 24 56 tgkill
0.04 0.000823 14 56 getpid
0.03 0.000684 12 56 rt_sigreturn
0.03 0.000677 15 44 write
0.02 0.000505 505 1 fsync
0.02 0.000443 21 21 1 statfs
0.02 0.000440 12 35 fstat
0.00 0.000093 46 2 renameat
0.00 0.000056 56 1 restart_syscall
0.00 0.000041 41 1 unlinkat
0.00 0.000019 2 8 lseek
0.00 0.000015 15 1 adjtimex
0.00 0.000013 13 1 uname
------ ----------- ----------- --------- --------- ----------------
100.00 2.048908 38263 1240 total
(I was a bit surprised by difference in futex && clock_gettime invocations count.)
@codesome Is there a reason why the WAL watcher couldn't use fsnotify for WAL segments? Alternatively I could pass a channel to the WAL watcher to inform it when new samples are written.
That is something @cstyan or @csmarchbanks would know. (I might be commenting this late already, I faintly remember you discussing with them some time ago)
When originally implementing the WAL based remote write changes we found fsnotify very unreliable. I don't remember the details since this was more than 4 years ago, perhaps it was just due to the number of events that are sent for write events and we lost events for new segment creation, but the WAL based approach was not going to work correctly using fsnotify.
When originally implementing the WAL based remote write changes we found fsnotify very unreliable.
I found the same in testing out an fsnotify approach. See my comment on prometheus/prometheus#11625:
I played around with an fsnotify approach and couldn't get it to work as expected. At least on my Mac, fsnotify was only emitting events every ~60 seconds with a 5 second scrape interval. I can't figure out why, which isn't giving me much faith in the fsnotify-based approach.
same issue here, can you provide a config option to change the frequency?
Just wanted to confirm seeing the same issue in the flow version of the agent.
A fix for this has been merged into Prometheus (prometheus/prometheus#11949). We'll be able to bring it downstream into the agent when Prometheus 2.45 is available.
This change has been ported over in https://github.com/grafana/agent/pull/3942/, so closing this as completed 🙌
I am using v0.36.2 static mode and using grafana-agent as a sidecar to airflow workers. I am observing a constant upward trend of CPU consumption over a period of time. Yes, prometheus_remote_write is configured as well. The sidecar grafana-agent is using the below config:
integrations:
prometheus_remote_write:
- basic_auth:
username: <VICTORIAMETRIC-USERNAME>
password: <VICTORIAMETRIC-PASSWORD>
url: <VICTORIAMETRIC-URL>
agent:
enabled: true
relabel_configs:
- action: replace
source_labels:
- agent_hostname
target_label: instance
logs:
global:
configs:
- name: integrations
clients:
- url: <LOKI-URL>
tenant_id: <some-tenant>
basic_auth:
username: <LOKI-USERNAME>
password: <LOKI-PASSWORD>
positions:
filename: /tmp/positions.yaml
scrape_configs:
- job_name: integrations/apache-airflow
static_configs:
- targets: [localhost]
labels:
job: integrations/apache-airflow
instance: <INSTANCE_NAME>
pod_name: <POD_NAME>
__path__: /logs/dag_id=*/**/*.log
pipeline_stages:
- match:
selector: '{job="integrations/apache-airflow",instance="<INSTANCE_NAME>"}'
stages:
- regex:
source: filename
expression: "/logs/dag_id=(?P<dag_id>\\S+?)/run_id=(?P<run_id>\\S+?)/task_id=(?P<task_id>\\S+?)(?:/map_index=(?P<map_index>\\d+))?(?:/attempt=(?P<attempt>\\d+))?\\.log"
- labels:
dag_id:
task_id:
map_index:
run_id:
attempt:
- multiline:
# match on timestamp. Format should be like '[2023-05-09T11:58:53.205+0000]'
firstline: '\\[\\d+-\\d+-\\d+T\\d+:\\d+:\\d+\\.\\d+\\+\\d+\\]'
metrics:
global:
scrape_interval: 60s
wal_directory: /tmp/grafana-agent-wal
The files in airflow are deleted after every 9 days. However no visible CPU drop observed.
P.S. - I am also confused here as the milestone is marked as v0.37.0 but the changelog says v0.36.0. If this is supposed to v0.37.0, then it's expected.
Hey there @Upanshu11.
We're no longer actively tracking closed issues; I just happened to see this. If you think there's still new information, could you please open a new issue and attach CPU and goroutine profiles by hitting the pprof endpoints of running agents?
Hey @tpaschalis. I realised the error in my code, the files rotation was a little buggy. Fixed it and CPU came down
Good to hear, thanks!