telegraf icon indicating copy to clipboard operation
telegraf copied to clipboard

sensors plugin reports higher different temperature than lm-sensors

Open rlipscombe opened this issue 1 year ago • 6 comments

Relevant telegraf.conf

# Monitor sensors, requires lm-sensors package
[[inputs.sensors]]
  ## Remove numbers from field names.
  ## If true, a field name like 'temp1_input' will be changed to 'temp_input'.
  # remove_numbers = true

  ## Timeout is the maximum amount of time that the sensors command can run.
  # timeout = "5s"

Logs from Telegraf

$ telegraf --config /etc/telegraf/telegraf.conf --input-filter sensors --test --debug
2022-08-29T15:28:53Z I! Starting Telegraf 1.21.4+ds1-0ubuntu2
2022-08-29T15:28:53Z I! Loaded inputs: sensors
2022-08-29T15:28:53Z I! Loaded aggregators:
2022-08-29T15:28:53Z I! Loaded processors:
2022-08-29T15:28:53Z W! Outputs are not used in testing mode!
2022-08-29T15:28:53Z I! Tags enabled: host=roger-nuc
2022-08-29T15:28:53Z D! [agent] Initializing plugins
2022-08-29T15:28:53Z D! [agent] Starting service inputs
2022-08-29T15:28:53Z D! [agent] Stopping service inputs
2022-08-29T15:28:53Z D! [agent] Input channel closed
> sensors,chip=coretemp-isa-0000,feature=package_id_0,host=roger-nuc temp_crit=100,temp_crit_alarm=0,temp_input=49,temp_max=100 1661786934000000000
2022-08-29T15:28:53Z D! [agent] Stopped Successfully
> sensors,chip=coretemp-isa-0000,feature=core_0,host=roger-nuc temp_crit=100,temp_crit_alarm=0,temp_input=41,temp_max=100 1661786934000000000
> sensors,chip=coretemp-isa-0000,feature=core_1,host=roger-nuc temp_crit=100,temp_crit_alarm=0,temp_input=49,temp_max=100 1661786934000000000
> sensors,chip=pch_skylake-virtual-0,feature=temp1,host=roger-nuc temp_input=38.5 1661786934000000000
> sensors,chip=acpitz-acpi-0,feature=temp1,host=roger-nuc temp_input=-263.2 1661786934000000000
> sensors,chip=nvme-pci-3c00,feature=composite,host=roger-nuc temp_alarm=0,temp_crit=84.85,temp_input=44.85,temp_max=79.85,temp_min=-5.15 1661786934000000000

System info

Telegraf 1.21.4+ds1-0ubuntu2, Ubuntu 22.04

Docker

No response

Steps to reproduce

  1. Default installation with apt install telegraf from default repositories.
  2. Enable the sensors plugin in /etc/telegraf/telegraf.conf
  3. Run telegraf --config /etc/telegraf/telegraf.conf --input-filter sensors --test --debug, note that the sensors,chip=coretemp-isa-0000,feature=package_id_0 has temp_input=49.
  4. Immediate run sensors. Note that the output shows 38C. This is 11 degrees cooler. This only affects the CPU temps. The other values agree.

Expected behavior

lm-sensors and the plugin should agree on the CPU temperature.

Actual behavior

telegraf reports a CPU temperature approximately 10 degrees warmer than lm-sensors.

Additional info

No response

rlipscombe avatar Aug 29 '22 15:08 rlipscombe

Note that this apparently only happens with --test; if I enable file output and then tail -f /tmp/metrics.out, the values are reported correctly.

rlipscombe avatar Aug 29 '22 15:08 rlipscombe

Hi @rlipscombe. Telegraf's sensors input plugin is very simple. It only runs the sensors program and scrapes its output. The plugin does no math on the values it scrapes. It's very unlikely that telegraf modified the temperature that it got from sensors.

It may be that you are running into a bug or quirk of the sensors program itself or the hardware it reads the tempurature from. I just tried to reproduce what you saw but I ran sensors from the cli first. For one temperature sensor it returned 50C, then when I ran telegraf it was 46C. Then I ran sensors and got 46C. Any combination of order of the two programs from then on returned 46C. I wonder if sensors itself returns inaccurate values in some cases the first time it's called.

I also replaced sensors with a script that returns the same format as sensors but always returns the same values. Whether I ran it directly or telegraf ran it and scraped it, the values were always the same.

reimda avatar Sep 12 '22 15:09 reimda

If you can share a reproducible case where the plugin is broken I'd be happy to check it out. Otherwise I don't see any change that telegraf needs to make and we should close this issue.

reimda avatar Sep 12 '22 15:09 reimda

I'm as puzzled as you, because I already assumed it simply scraped sensors, and yet the repro is exactly as detailed in the original report.

I did it again just now: I can run sensors multiple times, and it reports ~35C, then I immediately run telegraf --test and it reports ~46C.

Somewhat weirdly: if I run them both in split-pane tmux using watch, then the numbers agree. But the moment I run them one after the other, the broken behaviour's back. I don't believe that the CPU temp can change by 10C in under a second, so 😕.

rlipscombe avatar Sep 13 '22 07:09 rlipscombe

fwiw, here's the output of sensors -A -u (per https://github.com/influxdata/telegraf/blob/v1.24.0/plugins/inputs/sensors/sensors.go#L81):

$ sensors -A -u
coretemp-isa-0000
Package id 0:
  temp1_input: 35.000
  temp1_max: 100.000
  temp1_crit: 100.000
  temp1_crit_alarm: 0.000
Core 0:
  temp2_input: 34.000
  temp2_max: 100.000
  temp2_crit: 100.000
  temp2_crit_alarm: 0.000
Core 1:
  temp3_input: 34.000
  temp3_max: 100.000
  temp3_crit: 100.000
  temp3_crit_alarm: 0.000

pch_skylake-virtual-0
temp1:
  temp1_input: 35.500

acpitz-acpi-0
temp1:
  temp1_input: -263.200

iwlwifi_1-virtual-0
temp1:
ERROR: Can't get value of subfeature temp1_input: Can't read

nvme-pci-3c00
Composite:
  temp1_input: 44.850
  temp1_max: 79.850
  temp1_min: -5.150
  temp1_crit: 84.850
  temp1_alarm: 0.000

(that ERROR goes to stderr)

rlipscombe avatar Sep 13 '22 08:09 rlipscombe

@rlipscombe If you're able to reproduce it every time, it must not be the same thing as the four degrees difference I saw one time. Maybe it's something unique to your platform. If so, I'm not going to be able to reproduce it.

I made a PR for you to help us understand what is going on. The PR changes telegraf so it saves the output of sensors to a file before scraping it.

The PR build is available here https://github.com/influxdata/telegraf/pull/11808#issuecomment-1247047007 (the build will be automatically deleted after about a month)

Would you download and uncompress this build on your nuc, then run ./telegraf-1.25.0/usr/bin/telegraf --test --config /etc/telegraf/telegraf.conf --input-filter sensors 2>&1 | tee telegraf.txt? That will save two files in the current directory, telegraf.txt and telegraf-sensors.txt. Then run sensors -A -u 2>&1 | tee sensors.txt right afterward? That will save one file, sensors.txt.

We should be able to use those files to see whether telegraf is scraping the output incorrectly or if the output is really different when telegraf runs sensors compared to when you run it from the shell. Please attach all three files in a comment on this issue so I can look at them.

reimda avatar Sep 14 '22 17:09 reimda

telegraf-sensors.txt sensors.txt telegraf.txt

Looks like it's parsing the output entirely correctly. The problem seems to be that the temperature does apparently jump by 12C when running telegraf. That's really weird and suggests that (maybe) there's a bug in the chipset/firmware on this PC. Or maybe spinning up a go app raises the CPU freq which screws up the temperature readings. (shrug)

I'll keep digging, but it looks like it's not a bug in telegraf. Thanks for taking the time to look at it. Closing.

rlipscombe avatar Sep 23 '22 16:09 rlipscombe