node_exporter icon indicating copy to clipboard operation
node_exporter copied to clipboard

timex status bitmap fields

Open robbat2 opened this issue 1 year ago • 12 comments

This is sort of followup to https://github.com/prometheus/node_exporter/issues/2602

node_timex_status is a bitfield from adjtimex. Unfortunately Prometheus does not have bitmap math operations, so it's hard to check for specific bits being set in the metric.

As a start, I'd like it to be easier to detect STA_UNSYNC 0x0040 being set.

I don't know if there is prior art in node_exporter about the best way to represent bitfields. Naively it might be exploding the single metric to unique metrics, one per flag.

Host operating system: output of uname -a

Linux (REDACTED) 5.19.0-43-generic #44~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon May 22 13:39:36 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

What did you expect to see?

metrics for each flag

What did you see instead?

a single metric of a bitfield packed into an integer.

I would be happy to help implement this if a design is chosen.

robbat2 avatar Jun 23 '23 00:06 robbat2

The usual way to translate bitfields is to map each bit to a status string, and then have a series of bool values

Something like

node_timex_status{status="STA_UNSYNC"} 1

Since this is a bit of a breaking change, we would need to add it with a flag like --collector.timex.map-status-bits.

SuperQ avatar Jun 23 '23 07:06 SuperQ

Could we make it a non-breaking change by using a new name instead?

node_timex_status 8193
# HELP node_timex_status_flags Linux kernel timex.status bits, see adjtimex(2)
# TYPE node_timex_status_flags gauge
node_timex_status_flags{flag="STA_PLL"} 1
node_timex_status_flags{flag="STA_PPSFREQ"} 0
node_timex_status_flags{flag="STA_PPSTIME"} 0
node_timex_status_flags{flag="STA_FLL"} 0
node_timex_status_flags{flag="STA_INS"} 0
node_timex_status_flags{flag="STA_DEL"} 0
node_timex_status_flags{flag="STA_UNSYNC"} 0
node_timex_status_flags{flag="STA_FREQHOLD"} 0
node_timex_status_flags{flag="STA_PPSSIGNAL"} 0
node_timex_status_flags{flag="STA_PPSJITTER"} 0
node_timex_status_flags{flag="STA_PPSWANDER"} 0
node_timex_status_flags{flag="STA_PPSERROR"} 0
node_timex_status_flags{flag="STA_CLOCKERR"} 0
node_timex_status_flags{flag="STA_NANO"} 1
node_timex_status_flags{flag="STA_MODE"} 0
node_timex_status_flags{flag="STA_CLK"} 0

robbat2 avatar Jun 23 '23 21:06 robbat2

Sure, that works. Since this is a default enabled collector, perhaps we should still add a flag to either leave this off by default, or limit which status flags are exposed. This would limit the cardinality blowup.

SuperQ avatar Jun 24 '23 04:06 SuperQ

The flags haven't changed since 2008, and the field is full, it would have to be another variable in the timex struct.

I implemented the above as a test with recording rules and a bit of math, and it's working as expected, with the 16 extra metrics. We could trim the cardinality by only emitting the non-zero bits (drop the bool in my snippet below):

- name: node_timex_status_experiment
  rules:
    - record: node_timex_status_flag
      expr: ((node_timex_status % (0x0001*2)) - (node_timex_status % (0x0001))) > bool 0
      labels:
        flag: "STA_PLL"
    - record: node_timex_status_flag
      expr: ((node_timex_status % (0x0002*2)) - (node_timex_status % (0x0002))) > bool 0
      labels:
        flag: "STA_PPSFREQ"
    - record: node_timex_status_flag
      expr: ((node_timex_status % (0x0004*2)) - (node_timex_status % (0x0004))) > bool 0
      labels:
        flag: "STA_PPSTIME"
    - record: node_timex_status_flag
      expr: ((node_timex_status % (0x0008*2)) - (node_timex_status % (0x0008))) > bool 0
      labels:
        flag: "STA_FLL"
    - record: node_timex_status_flag
      expr: ((node_timex_status % (0x0010*2)) - (node_timex_status % (0x0010))) > bool 0
      labels:
        flag: "STA_INS"
    - record: node_timex_status_flag
      expr: ((node_timex_status % (0x0020*2)) - (node_timex_status % (0x0020))) > bool 0
      labels:
        flag: "STA_DEL"
    - record: node_timex_status_flag
      expr: ((node_timex_status % (0x0040*2)) - (node_timex_status % (0x0040))) > bool 0
      labels:
        flag: "STA_UNSYNC"
    - record: node_timex_status_flag
      expr: ((node_timex_status % (0x0080*2)) - (node_timex_status % (0x0080))) > bool 0
      labels:
        flag: "STA_FREQHOLD"
    - record: node_timex_status_flag
      expr: ((node_timex_status % (0x0100*2)) - (node_timex_status % (0x0100))) > bool 0
      labels:
        flag: "STA_PPSSIGNAL"
    - record: node_timex_status_flag
      expr: ((node_timex_status % (0x0200*2)) - (node_timex_status % (0x0200))) > bool 0
      labels:
        flag: "STA_PPSJITTER"
    - record: node_timex_status_flag
      expr: ((node_timex_status % (0x0400*2)) - (node_timex_status % (0x0400))) > bool 0
      labels:
        flag: "STA_PPSWANDER"
    - record: node_timex_status_flag
      expr: ((node_timex_status % (0x0800*2)) - (node_timex_status % (0x0800))) > bool 0
      labels:
        flag: "STA_PPSERROR"
    - record: node_timex_status_flag
      expr: ((node_timex_status % (0x1000*2)) - (node_timex_status % (0x1000))) > bool 0
      labels:
        flag: "STA_CLOCKERR"
    - record: node_timex_status_flag
      expr: ((node_timex_status % (0x2000*2)) - (node_timex_status % (0x2000))) > bool 0
      labels:
        flag: "STA_NANO"
    - record: node_timex_status_flag
      expr: ((node_timex_status % (0x4000*2)) - (node_timex_status % (0x4000))) > bool 0
      labels:
        flag: "STA_MODE"
    - record: node_timex_status_flag
      expr: ((node_timex_status % (0x8000*2)) - (node_timex_status % (0x8000))) > bool 0
      labels:
        flag: "STA_CLK"

robbat2 avatar Jun 24 '23 16:06 robbat2

Should we have bitmap opérations in PromQL?

roidelapluie avatar Jun 24 '23 17:06 roidelapluie

@roidelapluie That would be neat..

discordianfish avatar Jun 25 '23 13:06 discordianfish

Should we have bitmap opérations in PromQL?

Care to elaborate @roidelapluie if that is really being discussed?

I am likely dropping a brick here ... but Victoria Metrics did add this (https://github.com/VictoriaMetrics/VictoriaMetrics/commit/db1e62495be0435e2bcd75e77fe4925630679175) a while back and there a quite a few metrics (likely originating from some hardware registers) that are bitwise encoded.

frittentheke avatar Mar 27 '24 08:03 frittentheke

@frittentheke Nobody has submitted an issue or PR to add it. But PromQL does not currently take hex value syntax, only decimal, as metric values must be float64. I think it would be a nice addition to be able to parse hex numbers, and do bitwise operations. It could open the door to making handling of enum values easier.

SuperQ avatar Mar 28 '24 07:03 SuperQ

The metric from node_exporter is float64, and that that's fine here: node_timex_status 8193 PromQL itself permits hex numbers, so it's just adding bitwise ops

robbat2 avatar Mar 28 '24 17:03 robbat2

PromQL itself permits hex numbers, so it's just adding bitwise ops

@robbat2 Are you certain of this? Maybe you thought of the string literals (https://prometheus.io/docs/prometheus/latest/querying/basics/#string-literals)?

frittentheke avatar Jul 21 '24 20:07 frittentheke

https://prometheus.io/docs/prometheus/latest/querying/basics/#float-literals really clearly shows hex numbers: 0[xX][0-9a-fA-F]+

robbat2 avatar Jul 21 '24 22:07 robbat2

https://prometheus.io/docs/prometheus/latest/querying/basics/#float-literals really clearly shows hex numbers: 0[xX][0-9a-fA-F]+

Yeah, somehow I did only search for "hex" and was done with that : 🙈

@ Did you see my issue about implementing bitwise operations for PromQL? https://github.com/prometheus/prometheus/issues/14493.

Would be really a cool addition.

frittentheke avatar Jul 22 '24 06:07 frittentheke