node_exporter
node_exporter copied to clipboard
node_textfile_scrape_error not reported when failing to read metrics due to inconsistent HELP texts
Host operating system:
Linux tau-gfa-uat 5.4.0-62-generic #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
node_exporter version:
node_exporter, version 1.3.1 (branch: HEAD, revision: a2321e7b940ddcff26873612bccdf7cd4c42b6b6) build user: root@243aafa5525c build date: 20211205-11:09:49 go version: go1.17.3 platform: linux/amd64
node_exporter command line flags
/usr/local/bin/node_exporter \
--collector.disable-defaults \
--collector.cpu \
--collector.cpufreq \
--collector.diskstats \
--collector.edac \
--collector.filefd \
--collector.filesystem \
--collector.filesystem.fs-types-exclude \
'^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|tmpfs|fusectl|fuse.*|hugetlbfs|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$' \
--collector.hwmon \
--collector.loadavg \
--collector.mdadm \
--collector.meminfo \
--collector.netdev \
--collector.netstat \
--collector.pressure \
--collector.processes \
--collector.schedstat \
--collector.sockstat \
--collector.stat \
--collector.textfile \
--collector.textfile.directory \
/var/lib/node_exporter_textfile \
--collector.time \
--collector.timex \
--collector.vmstat \
--web.disable-exporter-metrics \
--web.listen-address \
:9100
Are you running node_exporter in Docker?
No
What did you do that produced an error?
For technical reasons, different parts of our system create different files for the same metric, but different labels. Due to code divergence, the files had inconsistent HELP messages.
For example:
# cat tau_infrastructure_performing_maintenance_task_foo.prom
# HELP tau_infrastructure_performing_maintenance_task The server is performing some long-running maintenance task
# TYPE tau_infrastructure_performing_maintenance_task gauge
tau_infrastructure_performing_maintenance_task{main_task="deployment", sub_task="deployment_ansible", start_or_stop="start"} 1645624007.0
and
# cat tau_infrastructure_performing_maintenance_task_bar.prom
# HELP tau_infrastructure_performing_maintenance_task At what timestamp a given task started or stopped, the last time it was run.
# TYPE tau_infrastructure_performing_maintenance_task gauge
tau_infrastructure_performing_maintenance_task{main_task="nightly",sub_task="main",start_or_stop="start"} 1647280801.98446
What did you expect to see?
I would expect either:
- To get all metrics, with node_exporter arbitrarily choosing one of the two HELP texts where there is a difference, or,
- if some metrics are dropped, for node_textfile_scrape_error to return non-zero for this job
What did you see instead?
The node_exporter logs an error about the situation, but not all of the metrics are scraped by prometheus, but node_textfile_scrape_error is nevertheless zero despite there being an issue preventing metrics from being exported.
Mar 15 10:43:25 REDACTED node_exporter[733184]: ts=2022-03-15T08:43:25.153Z caller=stdlib.go:105 level=error msg="error gathering metrics: 4 error(s) occurred:\n* [from Gatherer #2] collected metric tau_infrastructure_performing_maintenance_task label:<name:\"main_task\" value:\"nightly\" > label:<name:\"start_or_stop\" value:\"start\" > label:<name:\"sub_task\" value:\"main\" > gauge:<value:1.64728080198446e+09 > has help \"At what timestamp a given task started or stopped, the last time it was run.\" but should have \"The server is performing some long-running maintenance task\"\n* [from Gatherer #2] collected metric tau_infrastructure_performing_maintenance_task label:<name:\"main_task\" value:\"nightly\" > label:<name:\"start_or_stop\" value:\"stop\" > label:<name:\"sub_task\" value:\"main\" > gauge:<value:1.64728240041946e+09 > has help \"At what timestamp a given task started or stopped, the last time it was run.\" but should have \"The server is performing some long-running maintenance task\"\n* [from Gatherer #2] collected metric tau_infrastructure_performing_maintenance_task label:<name:\"main_task\" value:\"nightly\" > label:<name:\"start_or_stop\" value:\"start\" > label:<name:\"sub_task\" value:\"reporting\" > gauge:<value:1.64728080229161e+09 > has help \"At what timestamp a given task started or stopped, the last time it was run.\" but should have \"The server is performing some long-running maintenance task\"\n* [from Gatherer #2] collected metric tau_infrastructure_performing_maintenance_task label:<name:\"main_task\" value:\"nightly\" > label:<name:\"start_or_stop\" value:\"stop\" > label:<name:\"sub_task\" value:\"reporting\" > gauge:<value:1.64728239993993e+09 > has help \"At what timestamp a given task started or stopped, the last time it was run.\" but should have \"The server is performing some long-running maintenance task\""
The impact of this bug is that a large part of our monitoring was "flying blind", without us knowing, despite us having taken care to monitor node_textfile_scrape_error to warns us about such situations.
We have a similar issue. In our case a bug in one of our text file collector scripts introduces duplicate entries on some hosts. The log of the node_exporter is full of messages like this:
level=error ts=2022-03-21T10:55:22.451Z caller=stdlib.go:105 msg="error gathering metrics: 5 error(s) occurred:\n* [from Gatherer #2] collected metric \"smartmon_device_smart_available\" { label:<name:\"device\" value:\"/dev/bus/0\" > label:<name:\"disk\" value:\"0\" > gauge:<value:1 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"smartmon_device_smart_enabled\" { label:<name:\"device\" value:\"/dev/bus/0\" > label:<name:\"disk\" value:\"0\" > gauge:<value:1 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"smartmon_device_smart_healthy\" { label:<name:\"device\" value:\"/dev/bus/0\" > label:<name:\"disk\" value:\"0\" > gauge:<value:1 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"smartmon_smartctl_run\" { label:<name:\"device\" value:\"/dev/bus/0\" > label:<name:\"disk\" value:\"0\" > gauge:<value:1.647860106e+09 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"smartmon_device_active\" { label:<name:\"device\" value:\"/dev/bus/0\" > label:<name:\"disk\" value:\"0\" > gauge:<value:1 > } was collected before with the same name and label values“
Despite the log clearly stating that there have been errors while gathering metrics node_textfile_scrape_error remains at 0.
Yes, we should increase node_textfile_scrape_error in this case.
This looks like a tricky one to solve. For the most part, scraping and metric validation -- including the error message above -- is abstracted into the client_golang module, which has no awareness of anything happening in node_exporter. The latter can only detect errors such as file reading failures, not whether the data that was gathered consisted of valid metrics.
@rgroothuijsen Good point. We'd either need to wrap the registry in our own abstraction that does these checks or add a error counter to client_golang for such cases. Later would require some discussion on the prometheus-dev mailinglist I guess.