node_exporter icon indicating copy to clipboard operation
node_exporter copied to clipboard

Bug in OpenBSD CPU stats - Metrics appear to be only ~1/10th of the actual values

Open paketb0te opened this issue 2 years ago • 4 comments

Host operating system: output of uname -a

OpenBSD foo 7.3 GENERIC.MP#4 i386

node_exporter version: output of node_exporter --version

foo# node_exporter --version                                                   
node_exporter, version 1.5.0 (branch: non-git, revision: non-git)
  build user:       openbsd_ports
  build date:       2023-03-24
  go version:       go1.20.1
  platform:         openbsd/386

node_exporter command line flags

--web.listen-address=10.0.2.15:9100 --collector.textfile.directory=/tmp/textfile_metrics/

node_exporter log output

n/a

Are you running node_exporter in Docker?

No.

What did you do that produced an error?

Run node_exporter as a daemon on OpenBSD

What did you expect to see?

Correct CPU stats in Prometheus

What did you see instead?

Incorrect (from my understanding) CPU stats :)

We found that the expression sum by (cpu) (rate(node_cpu_seconds_total(instance="foo")[1m])) returns values arpund 0.1 instead of 1 (I am aware that this does not necessarily sums to 1 exactly, but something pretty close usually).

So it looks like we are off by a factor of around 10 :thinking:

Looking into cpu_openbsd.go, I found that the metrics are calculated by using sysctl kern.cp_time / sysctl kern.cp_time2 to get the number of ticks spent in each mode (at least that is what I understood from OpenBSD's sysctl manpage HERE), and then dividing that number by the clock rate (the number of ticks per second), which to me seems correct (although I am not sure about the difference between the "hard clock" and the "statistics clock" mentioned HERE, they are not different enough to explain the observed factor of 10).

So, given a clockrate of 100 hz (100 ticks per second), I would assume that the metrics are each just 1/100th of the values returned by sysctl kern.cp_time.

BUT when directly comparing the values returned from sysctl kern.cp_time with those returned by the exporter, we see they are more like 1/1000th (sysctl kern.cp_time returns the values in the order: interrupt, nice, user, system, spin, idle, see HERE):

foo# sysctl kern.clockrate                                                     
kern.clockrate=tick = 10000, hz = 100, profhz = 1024, stathz = 128
foo# 
foo# sysctl kern.cp_time && curl -s http://10.0.2.15:9100/metrics | grep -i cpu_seconds                                  
kern.cp_time=2391,0,1987,60,117,976656
# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 953.765625
node_cpu_seconds_total{cpu="0",mode="interrupt"} 0.1142578125
node_cpu_seconds_total{cpu="0",mode="nice"} 0
node_cpu_seconds_total{cpu="0",mode="spin"} 0.05859375
node_cpu_seconds_total{cpu="0",mode="system"} 1.94140625
node_cpu_seconds_total{cpu="0",mode="user"} 2.3349609375

Dividing the metric values by the values returned from sysctl kern.cp_time gives us 1024 :thinking: So to me it appears that somehow we get the wrong value as the clockrate, but I have not been able to figure out where / how exactly that happens - maybe the return values from unix.SysctlRaw("kern.clockrate") get mapped to the wrong fields of the clockinfo struct?

I hope I included enough information for troubleshooting by someone more knowledgeable in golang, please let me know if I can provide any further useful info or assist in any way.

paketb0te avatar Feb 19 '24 12:02 paketb0te

I just tested this against a "normal" install of OpenBSD (the i386 .iso from the official sources) instead of the self-compiled image where this behaviour was observed - and the bug was not present there!

So we'll investigate the build steps of our custom image.

paketb0te avatar Feb 21 '24 13:02 paketb0te

HI I see that you are using node_exporter on OpenBSD. Do you have some experience with it in configuring TLS encryption and how to do it?

manja-80 avatar Jun 26 '24 10:06 manja-80

@manja-80 no, unfortunately not. Maybe THIS helps?

paketb0te avatar Jun 26 '24 15:06 paketb0te

Thanks for reply. I saw it, but still figuring out how to configure it properly :)

manja-80 avatar Jun 27 '24 07:06 manja-80

@paketb0te Did you find something out? Otherwise lets close this for now

discordianfish avatar Jul 14 '24 11:07 discordianfish

@discordianfish haven't gotten around to dive deeper into the issue -> closing

paketb0te avatar Jul 14 '24 18:07 paketb0te

I was on vacation, so didn't have time to check or answer on it :)

manja-80 avatar Jul 16 '24 10:07 manja-80