node_exporter icon indicating copy to clipboard operation
node_exporter copied to clipboard

illumos/solaris CPU usage is reported in ticks, not seconds

Open davepacheco opened this issue 5 years ago • 1 comments

Host operating system: output of uname -a

$ uname -a
SunOS lennier 5.11 omnios-r151034-0d278a0cc5 i86pc i386 i86pc

node_exporter version: output of node_exporter --version

$ ./node_exporter --version
node_exporter, version 1.0.1 (branch: master, revision: d8a1585f59ef1169837d08979ecc92dcea8aa58a)
  build user:       dap@lennier
  build date:       20200904-20:16:54
  go version:       go1.14.7

node_exporter command line flags

No command-line flags passed (node_exporter)

Are you running node_exporter in Docker?

No.

What did you do that produced an error?

Viewed stat node_cpu_seconds_total.

What did you expect to see?

I expected to see the total number of seconds of idle time for this CPU since boot.

What did you see instead?

I saw the total number of idle ticks for this CPU since boot.


It's easier to look at all the data in one place:

# curl -s localhost:9100/metrics | grep cpu.*idle; kstat -p -m cpu -i 0 -n sys | grep cpu.*idle; kstat | grep nsec_per_tick
node_cpu_seconds_total{cpu="0",mode="idle"} 8.238178e+06
node_cpu_seconds_total{cpu="1",mode="idle"} 8.344892e+06
cpu:0:sys:cpu_nsec_idle 8238179276443
cpu:0:sys:cpu_ticks_idle        8238179
cpu:0:sys:idlethread    3961542
        nsec_per_tick                   1000000

What we see in this snippet is that:

  • node_reporter is reporting 8238178 for "node_cpu_seconds_total" for cpu=0 mode="idle". This stat is documented to be measured in seconds.
  • According to the underlying kstats, the CPU has been idle for 8238179276443 nanoseconds, or 8238.179276443 seconds. The stat is off by a factor of 1,000,000.

Looking at the source, it's pretty clear why: https://github.com/prometheus/node_exporter/blob/d8a1585f59ef1169837d08979ecc92dcea8aa58a/collector/cpu_solaris.go#L63-L66

It's pulling the "cpu_ticks_idle" kstat, which is measured in ticks. That's related to seconds by "nsec_per_tick". The above output shows that nsec_per_tick is 1,000,000 on this system, which explains why our output is off by a factor of 1,000,000.

As far as I can tell, this has always been wrong in this way. My guess is that users don't see this if they're always graphing a ratio of the CPU time metrics (e.g., idle / sum_of_all_of_them). You see this if you're trying to calculate idle percent as 100 * node_cpu_seconds_total{mode="idle"}, which should work.

The straightforward solution would be to use the cpu_nsec_{idle,kernel,user,wait} kstats instead of the cpu_ticks_{idle,kernel,user,wait} kstats. I don't know if we'd be worried about this being a breaking change.

CC @dsnt02518 (because you seem to be doing related work in #1803), @jpds (maybe I've misunderstood something here?)

davepacheco avatar Sep 05 '20 00:09 davepacheco

I've opened up a PR purely based on David's research above (and a bit of mine), which should address this bug.

rexagod avatar Mar 19 '24 14:03 rexagod