node_exporter uses excessive CPU on a ThunderX2 host
Host operating system: output of uname -a
Linux rhel-8-2-aarch64-gpu-1 4.18.0-193.14.2.el8_2.aarch64 #1 SMP Wed Jul 29 19:46:42 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux
node_exporter version: output of node_exporter --version
1.0.1
node_exporter command line flags
--collector.systemd --collector.textfile --collector.textfile.directory=/var/lib/node_exporter --collector.filesystem --collector.filesystem.ignored-fs-types='^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fuse..*|fusectl|hugetlbfs|iso9660|mqueue|nfs|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs|tmpfs)$' --collector.filesystem.ignored-mount-points='^/(dev|proc|run|sys)($|/)' --web.listen-address=0.0.0.0:9100
Are you running node_exporter in Docker?
No
What did you do that produced an error?
On a ThunderX2 machine, run node_exporter and top. The run curl localhost:9100/metrics and watch for node_exporter in top
The output of lscpu on this host is:
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 4
Core(s) per socket: 32
Socket(s): 2
NUMA node(s): 2
Vendor ID: Cavium
Model: 2
Model name: ThunderX2 99xx
Stepping: 0x1
CPU max MHz: 2500.0000
CPU min MHz: 1000.0000
BogoMIPS: 400.00
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 32768K
NUMA node0 CPU(s): 0-127
NUMA node1 CPU(s): 128-255
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm
What did you expect to see?
%CPU for node_exporter is a few percent.
What did you see instead?
CPU usage goes up to more than 1200%.
The cpu (-freq) collector(s) parallelize a lot, which can cause a sudden burst of threads accessing related kernel functionality pretty much at the same time. Could you try running node_exporter without the cpu* collectors to see if the issue is in those?
Thanks for the response. I can confirm that the excessive CPU load is not seen if node_exporter is started with --no-collector.cpufreq. We are not actually using the cpufreq metrics, so disabling the collector is an appropriate solution for us.
A long time ago I recall suggesting not to do frequency querying for hyperthreaded cores. The frequency for those is always the same as for the physical core, so one could as well just copy the information from the related physical core value. That would on some CPUs reduce the thread burden by 50% and as such it would be a valid optimization. But I fear that is probably not quite a sufficient remedy.
Perhaps limiting the thread count to something much lower than the physical core count is the way to go. Or the related kernel side functionality should be investigated to find out what exactly is so problematic when many threads try to read the frequency information at the same time.
Last but not least I would like to emphasize that reading the frequency information with as many threads as possible is problematic also in the sense that on some hardware architectures it has a rather significant impact on the result, basically exhibiting a CS case of observer effect. Starting as many threads as there are cores just to read the frequency information "fast" will on certain high-core architectures immediately kick the cores out from any kind of turbo-frequencies they might have used, basically rendering the information that is read, rather useless. You'll never see those turbo frequencies reported with node_exporter, even if they in fact are getting used. I could also imagine overall CPU-frequencies going up (but not as far as to turbo) during the period of reading the information, if spinlocks are getting congested (which could explain the perceived excessive CPU usage). But of the latter, I do not have proof. For the lack of turbo-frequencies getting reported, that I know and have witnessed for sure on high core count machines.
@SuperQ please take note
The problem with limiting threads is it will cause the scrape to be slow and just defer the problem.
Remember, these are not OS threads, these are goroutines. If you want to limit the core time of the node_exporter, it's easier to use GOMAXPROCS or cgroup controls.
I can verify that this can be an issue also for certain high core count Xeons running short-ish scrape intervals, and the workaround is disabling of cpufreq collector.
I haven't seen any issues with low core count nodes.
Can y'all confirm this happens with both the cpu and the cpufreq collector?
Given that more people report this now, we need to do something about this. I suspect there is some kernel issue that we're triggering. We're saying:
// Execute the parsing of each CPU in parallel.
// This is done because the kernel intentionally delays access to each CPU by
// 50 milliseconds to avoid DDoSing possibly expensive functions.
- https://github.com/prometheus/procfs/blob/master/sysfs/system_cpu.go#L164 Have we confirmed that this is the case? Maybe the kernel delays it precisely because rapid querying this endpoint results in the problems we're saying.
IIRC, @rtreffer found the explicit 50ms delay in the kernel code.
Can y'all confirm this happens with both the cpu and the cpufreq collector?
For us disabling cpufreq-collector was sufficient to bring the load down and the nagging from the fellow maintaining one particular cluster ceased with that. We do still run the cpu-collector.
Thanks, I'm guessing this would be solved with GOMAXPROCS=1 as well.
@uniemimu Can you try setting GOMAXPROCS=1, enable the cpufreq collector and see if you still see the issue?
For reference, the original issue about cpufreq taking too long is here: https://github.com/prometheus/node_exporter/issues/1086
@uniemimu Can you try setting GOMAXPROCS=1, enable the cpufreq collector and see if you still see the issue?
It is better, and doesn't get out of control anymore in a 96 core machine. However on an idling machine you can still easily tell that cpufreq is running, the 1m load went from 0.5 to 1.5 or like that. Compare with the default GOMAXPROCS the system becomes rather unstable and logging in with ssh is very sluggish, with the user quickly going for "sudo kill -9 pidof node_exporter". Load averages are in the hundreds then.
Any chance we could get access to one of these systems to do some tracing? Unfortunately, I have no such hardware to test with.
Any chance we could get access to one of these systems to do some tracing? Unfortunately, I have no such hardware to test with.
From my side I can't unfortunately give access to these. But I did a perf-trace:
42.43% swapper
99.72% [kernel.kallsyms]
18.80% node_exporter
82.58% [kernel.kallsyms]
17.23% node_exporter
14.47% kubelet
62.91% kubelet
36.19% [kernel.kallsyms]
6.66% containerd-shim
61.98% [kernel.kallsyms]
36.74% containerd-shim-runc-v2
1.27% [vdso]
3.37% containerd
68.11% containerd
31.02% [kernel.kallsyms]
1.20% alertmanager
55.03% alertmanager
43.07% [kernel.kallsyms]
1.21% [vdso]
1.06% weaver
58.93% weaver
37.85% [kernel.kallsyms]
2.67% [unknown]
I only allowed it to run for a while with the default GOMAXPROCS, system load had climbed to around 10 when I started tracing. It would have climbed to hundreds within an hour. The issue is at the kernel side obviously, where exactly is another thing. I was running 5.8.0-41-generic kernel during the trace, but 5.9.0 is no better.
What would be useful is a Go pprof CPU profile.
- Startup node_exporter (on any port you want)
go tool pprof -seconds 60 "http://${target}:${port}/debug/pprof/profile"- Scrape the test node_exporter.
- Save the profile (i.e.
${HOME}/pprof/pprof.node_exporter.samples.cpu.001.pb.gz)
The pprof tool will gather data for as long as you specify -seconds.
Meanwhile I took a look at kernel.map which I happened to find, and the thing node_exporter is reaching for in the kernel points to osq_lock. Which to me makes perfect sense, I was expecting it to be fighting over a (spin)lock down there, for some reason or the other.
@SuperQ With >4/5 of the "node_exporter" CPU cycles going to kernel per-CPU spinlock spinning on Xeon, I think that "strace" is enough to debug the issue, and that it can be done on any machine.
See if there are any syscalls that might require kernel internal spinlocks (when you enable just cpufreq collector), and reduce the number of those that would scale up with the number of host CPUs.
Looking at the kernel osq_lock history: https://github.com/torvalds/linux/commits/master/kernel/locking/osq_lock.c There haven't been any recent performance changes, so kernel version doesn't matter as long as it's from 2018 or later.
Alright, I guess we need to trace which exact thing is causing the spinlock race, and put some kind of concurrency wrapper around it.
So the core tension here seems to be the kernel is trying intentionally to limit userspace frequency here for good reason, vs node_exporter not wanting to block. But isn't explicitly just using one asynchronous goroutine here sufficient?
@cgwalters That's still a bit unclear. We don't think it can be explained in this case by just intentional throttling but nobody had yet time to dive deeper into that issue. We tried with GOMAXPROCS=1 but using one goroutine might be worth a try.
I have published node_exporter v1.5.0 which now defaults GOMAXPROCS to 1.
We have also converted a couple of the collectors from reading files to netlink. This should improve the performance greatly.
Although for one of these, you will need to use the flag --collector.netclass.netlink to enable it.
I've had enough issues on my large physical hosts that I'm thinking cpufreq should be disabled by default (do we have any sense of how widely used those metrics are?).
A sample:
48 CPUs (2 sockets, 12 cores per, 2 threads per). node_exporter 1.3.1 with GOMAXPROCS=1 set via systemd override (I can share how much worse it looked before setting that).
Those queries are
avg_over_time(node_load1{instance="$instance"}[$__interval])
and
rate(process_cpu_seconds_total{instance="$instance", job="node"}[$__rate_interval])
You can see when I disabled the cpufreq collector on the 26th. All the little load spikes went away (there are still some spikes from real work). A nice drop in the CPU utilization as well.
Installing 1.5.0 today yielded even better results with the CPU. I briefly re-enabled cpufreq on 1.5.0 - still an issue.
cpufreq is probably only an issue on these sorts of large hosts but it seems common enough on those that it should not be enabled by default.
This looks to be an issue on about 20% of my hosts - i.e., all of the large ones.
It would be useful to get a pprof CPU profile when the problem is happening.
Load average is not really a useful metric, as it doesn't really reflect any real world problem.
Load average is not really a useful metric, as it doesn't really reflect any real world problem.
Agreed, but it's what got my users complaining. The CPU utilization is real (not huge, but real).
Let me scare up a suitable test host and get that pprof profile.
pprof.node_exporter.samples.cpu.001.pb.gz
This is from the same machine that's shown in the graph. node_exporter 1.5.0, only --web.listen-address & --collector.textfile.directory specified.
It was just a single scrape - let me know if you want one with repeated scrapes (my poll interval is 10s).
Hmm, that pprof file is only showing 80ms of CPU time over 60s. Even if it's just one scrape over 10s, that's only 0.8% average CPU use. I don't see much we can improve on there. Most of the time is spent in syscalls, which we don't have a lot of control over, that's up to the kernel.
Could there be a cumulative issue that shows up after repeated scrapes? You can see from the graph that CPU utilization is cut by more than half when cpufreq was disabled in production, which seems significant.
It'll also be 2 scrapes in 10 seconds, since there are two pollers.
Could there be a cumulative issue that shows up after repeated scrapes
Not likely. This code is very simple and basically no different to other collectors.
The only issue that I can see, which is specific to cpufreq, is there is something wrong with the Kernel access to the data.