node_exporter
                                
                                 node_exporter copied to clipboard
                                
                                    node_exporter copied to clipboard
                            
                            
                            
                        Handle thermal_zone errors gracefully
Host operating system:
Linux 5.10.104-tegra #18 SMP PREEMPT aarch64 aarch64 aarch64 GNU/Linux
node_exporter version:
1.7.0
node_exporter command line flags:
--path.rootfs=/host
node_exporter log output
...
caller=collector.go:169 level=error msg="collector failed" name=thermal_zone duration_seconds=0.01870677 err="read /sys/class/thermal/thermal_zone10/temp: invalid argument"
caller=collector.go:169 level=error msg="collector failed" name=thermal_zone duration_seconds=0.001411717 err="read /sys/class/thermal/thermal_zone10/temp: invalid argument"
...
Are you running node_exporter in Docker?
Yes
What did you do that produced an error?
Running node_exporter in a docker container on a custom embedded device.
What did you expect to see?
Disabled thermal zones as either being ignored or optionally being filtered out.
What did you see instead?
The entire thermal_zone collector fails for all thermal_zones.
When a thermal zone is disabled which can be determined via /sys/class/thermal/thermal_zone10/mode, it would be nice for node_exporter to handle it gracefully whether natively or via flag, or allow specific files/devices be filtered out manually instead of as an entire class of devices.
My temporry workaround has been to use the Pushgateway with a curl container in my docker compose file as so:
  pushgateway:
    image: prom/pushgateway
    container_name: pushgateway
    restart: unless-stopped
    networks:
      - metrics
  curl_thermals:
    image: curlimages/curl
    container_name: curl_thermals
    command: '/bin/sh /pushgateway-thermal-zones.sh'
    pid: host
    restart: unless-stopped
    volumes:
      - /:/host:ro,rslave
      - ./pushgateway-thermal-zones.sh:/pushgateway-thermal-zones.sh:ro,rslave
    networks:
      - metrics
With this pushgateway-thermal-zones.sh script:
while true
do 
    output="# TYPE thermal_zone gauge\n# HELP thermal_zone Thermal zone temperatures in Celsius\n"
    # Loop through each thermal zone directory in /host/sys/class/thermal
    for zone in /host/sys/class/thermal/thermal_zone*; do
        # Check if the thermal zone is enabled by reading the mode file
        mode=$(cat "${zone}/mode")
        if [ "${mode}" = "enabled" ]; then
            zone_number=$(basename ${zone} | sed 's/thermal_zone//')
            zone_type=$(cat "${zone}/type")
            zone_temp=$(cat "${zone}/temp")
            zone_temp_scaled=$(echo "scale=2; ${zone_temp} / 1000.0" | bc)
            # Append the details to the output variable
            output="${output}thermal_zone{zone=\"${zone_number}\", type=\"${zone_type}\"} ${zone_temp_scaled}\n"
        fi
    done
    echo -e $output | curl -s --data-binary @- http://pushgateway:9091/metrics/job/thermal_zones/
    sleep 3
done
Seems like the error is coming from here: https://github.com/prometheus/procfs/blob/69fc8f61debb3bd7efca3a9a1c295d4012022830/sysfs/class_thermal.go#L73 / https://github.com/prometheus/procfs/blob/69fc8f61debb3bd7efca3a9a1c295d4012022830/sysfs/class_thermal.go#L52 - maybe there should be a check here if the error is of type os.ErrInvalid and either return an empty ClassThermalZonesStat{} or ignore it. Another option could be to check the mode for ‘disabled’ first in parseClassThermalZone() and return early.
not sure how to achieve this directly from node_exporter.
@Kylea650 Checking mode for disabled sounds like a good option. If anyone wants to submit a PR to sysfs feel free to ping me there
@discordianfish Happy to raise a new issue mentioning this one and PR over in sysfs this week. Cheers!
Is this issue still open?