node_exporter icon indicating copy to clipboard operation
node_exporter copied to clipboard

CPU collector keeps reporting metrics of offline CPUs

Open raptorsun opened this issue 2 years ago • 0 comments

Host operating system: output of uname -a

Linux debianvm 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 1.5.0 (branch: master, revision: 7f05e312acd828f0c4ddcd194e275c05ffa35992)
  build user:       hsun@debianvm
  build date:       20230217-16:48:57
  go version:       go1.20.1
  platform:         linux/amd64

node_exporter command line flags

None.

node_exporter log output

ts=2023-02-17T16:53:10.662Z caller=node_exporter.go:180 level=info msg="Starting node_exporter" version="(version=1.5.0, branch=master, revision=7f05e312acd828f0c4ddcd194e275c05ffa35992)"
ts=2023-02-17T16:53:10.663Z caller=node_exporter.go:181 level=info msg="Build context" build_context="(go=go1.20.1, platform=linux/amd64, user=hsun@debianvm, date=20230217-16:48:57)"
ts=2023-02-17T16:53:10.663Z caller=filesystem_common.go:111 level=info collector=filesystem msg="Parsed flag --collector.filesystem.mount-points-exclude" flag=^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+|var/lib/containers/storage/.+)($|/)
ts=2023-02-17T16:53:10.664Z caller=filesystem_common.go:113 level=info collector=filesystem msg="Parsed flag --collector.filesystem.fs-types-exclude" flag=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$
ts=2023-02-17T16:53:10.665Z caller=diskstats_common.go:111 level=info collector=diskstats msg="Parsed flag --collector.diskstats.device-exclude" flag=^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p)\d+$
ts=2023-02-17T16:53:10.665Z caller=node_exporter.go:110 level=info msg="Enabled collectors"
ts=2023-02-17T16:53:10.665Z caller=node_exporter.go:117 level=info collector=arp
ts=2023-02-17T16:53:10.665Z caller=node_exporter.go:117 level=info collector=bcache
ts=2023-02-17T16:53:10.665Z caller=node_exporter.go:117 level=info collector=bonding
ts=2023-02-17T16:53:10.665Z caller=node_exporter.go:117 level=info collector=btrfs
ts=2023-02-17T16:53:10.665Z caller=node_exporter.go:117 level=info collector=conntrack
ts=2023-02-17T16:53:10.665Z caller=node_exporter.go:117 level=info collector=cpu
ts=2023-02-17T16:53:10.665Z caller=node_exporter.go:117 level=info collector=cpufreq
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=diskstats
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=dmi
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=edac
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=entropy
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=fibrechannel
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=filefd
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=filesystem
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=hwmon
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=infiniband
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=ipvs
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=loadavg
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=mdadm
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=meminfo
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=netclass
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=netdev
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=netstat
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=nfs
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=nfsd
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=nvme
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=os
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=powersupplyclass
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=pressure
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=rapl
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=schedstat
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=selinux
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=sockstat
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=softnet
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=stat
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=tapestats
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=textfile
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=thermal_zone
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=time
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=timex
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=udp_queues
ts=2023-02-17T16:53:10.666Z caller=node_exporter.go:117 level=info collector=uname
ts=2023-02-17T16:53:10.667Z caller=node_exporter.go:117 level=info collector=vmstat
ts=2023-02-17T16:53:10.667Z caller=node_exporter.go:117 level=info collector=xfs
ts=2023-02-17T16:53:10.667Z caller=node_exporter.go:117 level=info collector=zfs
ts=2023-02-17T16:53:10.667Z caller=tls_config.go:232 level=info msg="Listening on" address=[::]:9100
ts=2023-02-17T16:53:10.667Z caller=tls_config.go:235 level=info msg="TLS is disabled." http2=false address=[::]:9100

Are you running node_exporter in Docker?

No.

What did you do that produced an error?

  1. Start Node Exporter.
  2. Several CPUs become offline. (hyperthreading/SMT changes)
  3. Metrics from offline CPU persists

What did you expect to see?

Metrics from offline CPUs should disappear.

What did you see instead?

Metrics from offline CPU persist.

More Details

node_cpu_seconds_total metrics before CPU 5 went offline.

# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 12295.95
node_cpu_seconds_total{cpu="0",mode="iowait"} 11.01
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 30.25
node_cpu_seconds_total{cpu="0",mode="softirq"} 26.99
node_cpu_seconds_total{cpu="0",mode="steal"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 256.7
node_cpu_seconds_total{cpu="0",mode="user"} 173.35
node_cpu_seconds_total{cpu="1",mode="idle"} 12099.1
node_cpu_seconds_total{cpu="1",mode="iowait"} 4.24
node_cpu_seconds_total{cpu="1",mode="irq"} 0
node_cpu_seconds_total{cpu="1",mode="nice"} 94.82
node_cpu_seconds_total{cpu="1",mode="softirq"} 28.62
node_cpu_seconds_total{cpu="1",mode="steal"} 0
node_cpu_seconds_total{cpu="1",mode="system"} 280.99
node_cpu_seconds_total{cpu="1",mode="user"} 193.84
node_cpu_seconds_total{cpu="2",mode="idle"} 12168.34
node_cpu_seconds_total{cpu="2",mode="iowait"} 5.68
node_cpu_seconds_total{cpu="2",mode="irq"} 0
node_cpu_seconds_total{cpu="2",mode="nice"} 67
node_cpu_seconds_total{cpu="2",mode="softirq"} 25.54
node_cpu_seconds_total{cpu="2",mode="steal"} 0
node_cpu_seconds_total{cpu="2",mode="system"} 275.96
node_cpu_seconds_total{cpu="2",mode="user"} 193.49
node_cpu_seconds_total{cpu="3",mode="idle"} 12237.18
node_cpu_seconds_total{cpu="3",mode="iowait"} 2.33
node_cpu_seconds_total{cpu="3",mode="irq"} 0
node_cpu_seconds_total{cpu="3",mode="nice"} 40.56
node_cpu_seconds_total{cpu="3",mode="softirq"} 37.14
node_cpu_seconds_total{cpu="3",mode="steal"} 0
node_cpu_seconds_total{cpu="3",mode="system"} 246.59
node_cpu_seconds_total{cpu="3",mode="user"} 173.91
node_cpu_seconds_total{cpu="4",mode="idle"} 12184.39
node_cpu_seconds_total{cpu="4",mode="iowait"} 5.22
node_cpu_seconds_total{cpu="4",mode="irq"} 0
node_cpu_seconds_total{cpu="4",mode="nice"} 46.28
node_cpu_seconds_total{cpu="4",mode="softirq"} 26.59
node_cpu_seconds_total{cpu="4",mode="steal"} 0
node_cpu_seconds_total{cpu="4",mode="system"} 279.66
node_cpu_seconds_total{cpu="4",mode="user"} 188.91
node_cpu_seconds_total{cpu="5",mode="idle"} 1318.18
node_cpu_seconds_total{cpu="5",mode="iowait"} 0.26
node_cpu_seconds_total{cpu="5",mode="irq"} 0
node_cpu_seconds_total{cpu="5",mode="nice"} 0.01
node_cpu_seconds_total{cpu="5",mode="softirq"} 51.1
node_cpu_seconds_total{cpu="5",mode="steal"} 0
node_cpu_seconds_total{cpu="5",mode="system"} 235.27
node_cpu_seconds_total{cpu="5",mode="user"} 66.04

/proc/stat before CPU 5 offline

cpu  99263 27893 157903 6283804 2883 0 19624 0 0 0
cpu0 17396 3025 25736 1238523 1106 0 2704 0 0 0
cpu1 19434 9482 28148 1218751 425 0 2872 0 0 0
cpu2 19410 6700 27672 1225730 568 0 2557 0 0 0
cpu3 17441 4056 24731 1232606 234 0 3714 0 0 0
cpu4 18935 4628 28022 1227396 522 0 2662 0 0 0
cpu5 6644 1 23591 140796 26 0 5113 0 0 0
intr 6330554 36 4281 0 0 0 0 0 0 0 0 0 0 7776 0 0 12083 0 0 100661 24266 140115 37165 56 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 10356386
btime 1676640236
processes 29007
procs_running 1
procs_blocked 0
softirq 2398665 11 525797 2354 24836 44301 2 4406 816272 0 980686

node_cpu_seconds_total metrics after CPU 5 went offline. The metrics of CPU 5 stay the same value everytime we refresh the metrics in Node Exporter.

node_cpu_seconds_total{cpu="0",mode="idle"} 12489.89
node_cpu_seconds_total{cpu="0",mode="iowait"} 11.1
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 30.25
node_cpu_seconds_total{cpu="0",mode="softirq"} 27.04
node_cpu_seconds_total{cpu="0",mode="steal"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 257.56
node_cpu_seconds_total{cpu="0",mode="user"} 174.16
node_cpu_seconds_total{cpu="1",mode="idle"} 12290.9
node_cpu_seconds_total{cpu="1",mode="iowait"} 4.3
node_cpu_seconds_total{cpu="1",mode="irq"} 0
node_cpu_seconds_total{cpu="1",mode="nice"} 94.82
node_cpu_seconds_total{cpu="1",mode="softirq"} 28.74
node_cpu_seconds_total{cpu="1",mode="steal"} 0
node_cpu_seconds_total{cpu="1",mode="system"} 281.86
node_cpu_seconds_total{cpu="1",mode="user"} 194.54
node_cpu_seconds_total{cpu="2",mode="idle"} 12361.52
node_cpu_seconds_total{cpu="2",mode="iowait"} 5.68
node_cpu_seconds_total{cpu="2",mode="irq"} 0
node_cpu_seconds_total{cpu="2",mode="nice"} 67
node_cpu_seconds_total{cpu="2",mode="softirq"} 25.58
node_cpu_seconds_total{cpu="2",mode="steal"} 0
node_cpu_seconds_total{cpu="2",mode="system"} 276.98
node_cpu_seconds_total{cpu="2",mode="user"} 194.38
node_cpu_seconds_total{cpu="3",mode="idle"} 12430.5
node_cpu_seconds_total{cpu="3",mode="iowait"} 2.34
node_cpu_seconds_total{cpu="3",mode="irq"} 0
node_cpu_seconds_total{cpu="3",mode="nice"} 40.56
node_cpu_seconds_total{cpu="3",mode="softirq"} 37.14
node_cpu_seconds_total{cpu="3",mode="steal"} 0
node_cpu_seconds_total{cpu="3",mode="system"} 247.52
node_cpu_seconds_total{cpu="3",mode="user"} 174.7
node_cpu_seconds_total{cpu="4",mode="idle"} 12378.86
node_cpu_seconds_total{cpu="4",mode="iowait"} 5.22
node_cpu_seconds_total{cpu="4",mode="irq"} 0
node_cpu_seconds_total{cpu="4",mode="nice"} 46.28
node_cpu_seconds_total{cpu="4",mode="softirq"} 26.63
node_cpu_seconds_total{cpu="4",mode="steal"} 0
node_cpu_seconds_total{cpu="4",mode="system"} 280.47
node_cpu_seconds_total{cpu="4",mode="user"} 189.53
node_cpu_seconds_total{cpu="5",mode="idle"} 1318.18
node_cpu_seconds_total{cpu="5",mode="iowait"} 0.26
node_cpu_seconds_total{cpu="5",mode="irq"} 0
node_cpu_seconds_total{cpu="5",mode="nice"} 0.01
node_cpu_seconds_total{cpu="5",mode="softirq"} 51.1
node_cpu_seconds_total{cpu="5",mode="steal"} 0
node_cpu_seconds_total{cpu="5",mode="system"} 235.27
node_cpu_seconds_total{cpu="5",mode="user"} 66.04

/proc/stat after CPU 5 offline. The metrics of CPU 5 disappeared.

cpu  99348 27893 158013 6650156 3173 0 19627 0 0 0
cpu0 17409 3025 25754 1248489 1110 0 2704 0 0 0
cpu1 19446 9482 28180 1228625 430 0 2873 0 0 0
cpu2 19435 6700 27683 1235686 568 0 2558 0 0 0
cpu3 17461 4056 24749 1242583 234 0 3714 0 0 0
cpu4 18949 4628 28042 1237398 522 0 2662 0 0 0
intr 6357674 36 4385 0 0 0 0 0 0 0 0 0 0 7848 0 0 12183 0 0 100819 24381 143630 37230 56 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 10418558
btime 1676640236
processes 29024
procs_running 1
procs_blocked 0
softirq 2408562 12 528701 2375 24957 44417 3 4439 819905 0 983753

raptorsun avatar Feb 17 '23 17:02 raptorsun