node_exporter icon indicating copy to clipboard operation
node_exporter copied to clipboard

node_textfile_scrape_error not reported when failing to read metrics due to inconsistent HELP texts

Open pieter-lautus opened this issue 3 years ago • 5 comments

Host operating system:

Linux tau-gfa-uat 5.4.0-62-generic #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version:

node_exporter, version 1.3.1 (branch: HEAD, revision: a2321e7b940ddcff26873612bccdf7cd4c42b6b6) build user: root@243aafa5525c build date: 20211205-11:09:49 go version: go1.17.3 platform: linux/amd64

node_exporter command line flags

/usr/local/bin/node_exporter \
  --collector.disable-defaults \
  --collector.cpu \
  --collector.cpufreq \
  --collector.diskstats \
  --collector.edac \
  --collector.filefd \
  --collector.filesystem \
  --collector.filesystem.fs-types-exclude \
  '^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|tmpfs|fusectl|fuse.*|hugetlbfs|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$' \
  --collector.hwmon \
  --collector.loadavg \
  --collector.mdadm \
  --collector.meminfo \
  --collector.netdev \
  --collector.netstat \
  --collector.pressure \
  --collector.processes \
  --collector.schedstat \
  --collector.sockstat \
  --collector.stat \
  --collector.textfile \
  --collector.textfile.directory \
  /var/lib/node_exporter_textfile \
  --collector.time \
  --collector.timex \
  --collector.vmstat \
  --web.disable-exporter-metrics \
  --web.listen-address \
  :9100

Are you running node_exporter in Docker?

No

What did you do that produced an error?

For technical reasons, different parts of our system create different files for the same metric, but different labels. Due to code divergence, the files had inconsistent HELP messages.

For example:

# cat tau_infrastructure_performing_maintenance_task_foo.prom 
# HELP tau_infrastructure_performing_maintenance_task The server is performing some long-running maintenance task
# TYPE tau_infrastructure_performing_maintenance_task gauge
tau_infrastructure_performing_maintenance_task{main_task="deployment", sub_task="deployment_ansible", start_or_stop="start"} 1645624007.0

and

# cat tau_infrastructure_performing_maintenance_task_bar.prom 
# HELP tau_infrastructure_performing_maintenance_task At what timestamp a given task started or stopped, the last time it was run.
# TYPE tau_infrastructure_performing_maintenance_task gauge
tau_infrastructure_performing_maintenance_task{main_task="nightly",sub_task="main",start_or_stop="start"} 1647280801.98446

What did you expect to see?

I would expect either:

  1. To get all metrics, with node_exporter arbitrarily choosing one of the two HELP texts where there is a difference, or,
  2. if some metrics are dropped, for node_textfile_scrape_error to return non-zero for this job

What did you see instead?

The node_exporter logs an error about the situation, but not all of the metrics are scraped by prometheus, but node_textfile_scrape_error is nevertheless zero despite there being an issue preventing metrics from being exported.

Mar 15 10:43:25 REDACTED node_exporter[733184]: ts=2022-03-15T08:43:25.153Z caller=stdlib.go:105 level=error msg="error gathering metrics: 4 error(s) occurred:\n* [from Gatherer #2] collected metric tau_infrastructure_performing_maintenance_task label:<name:\"main_task\" value:\"nightly\" > label:<name:\"start_or_stop\" value:\"start\" > label:<name:\"sub_task\" value:\"main\" > gauge:<value:1.64728080198446e+09 >  has help \"At what timestamp a given task started or stopped, the last time it was run.\" but should have \"The server is performing some long-running maintenance task\"\n* [from Gatherer #2] collected metric tau_infrastructure_performing_maintenance_task label:<name:\"main_task\" value:\"nightly\" > label:<name:\"start_or_stop\" value:\"stop\" > label:<name:\"sub_task\" value:\"main\" > gauge:<value:1.64728240041946e+09 >  has help \"At what timestamp a given task started or stopped, the last time it was run.\" but should have \"The server is performing some long-running maintenance task\"\n* [from Gatherer #2] collected metric tau_infrastructure_performing_maintenance_task label:<name:\"main_task\" value:\"nightly\" > label:<name:\"start_or_stop\" value:\"start\" > label:<name:\"sub_task\" value:\"reporting\" > gauge:<value:1.64728080229161e+09 >  has help \"At what timestamp a given task started or stopped, the last time it was run.\" but should have \"The server is performing some long-running maintenance task\"\n* [from Gatherer #2] collected metric tau_infrastructure_performing_maintenance_task label:<name:\"main_task\" value:\"nightly\" > label:<name:\"start_or_stop\" value:\"stop\" > label:<name:\"sub_task\" value:\"reporting\" > gauge:<value:1.64728239993993e+09 >  has help \"At what timestamp a given task started or stopped, the last time it was run.\" but should have \"The server is performing some long-running maintenance task\""

pieter-lautus avatar Mar 15 '22 08:03 pieter-lautus

The impact of this bug is that a large part of our monitoring was "flying blind", without us knowing, despite us having taken care to monitor node_textfile_scrape_error to warns us about such situations.

pieter-lautus avatar Mar 15 '22 08:03 pieter-lautus

We have a similar issue. In our case a bug in one of our text file collector scripts introduces duplicate entries on some hosts. The log of the node_exporter is full of messages like this:

level=error ts=2022-03-21T10:55:22.451Z caller=stdlib.go:105 msg="error gathering metrics: 5 error(s) occurred:\n* [from Gatherer #2] collected metric \"smartmon_device_smart_available\" { label:<name:\"device\" value:\"/dev/bus/0\" > label:<name:\"disk\" value:\"0\" > gauge:<value:1 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"smartmon_device_smart_enabled\" { label:<name:\"device\" value:\"/dev/bus/0\" > label:<name:\"disk\" value:\"0\" > gauge:<value:1 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"smartmon_device_smart_healthy\" { label:<name:\"device\" value:\"/dev/bus/0\" > label:<name:\"disk\" value:\"0\" > gauge:<value:1 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"smartmon_smartctl_run\" { label:<name:\"device\" value:\"/dev/bus/0\" > label:<name:\"disk\" value:\"0\" > gauge:<value:1.647860106e+09 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"smartmon_device_active\" { label:<name:\"device\" value:\"/dev/bus/0\" > label:<name:\"disk\" value:\"0\" > gauge:<value:1 > } was collected before with the same name and label values“

Despite the log clearly stating that there have been errors while gathering metrics node_textfile_scrape_error remains at 0.

equinox0815 avatar Mar 21 '22 13:03 equinox0815

Yes, we should increase node_textfile_scrape_error in this case.

discordianfish avatar Mar 22 '22 11:03 discordianfish

This looks like a tricky one to solve. For the most part, scraping and metric validation -- including the error message above -- is abstracted into the client_golang module, which has no awareness of anything happening in node_exporter. The latter can only detect errors such as file reading failures, not whether the data that was gathered consisted of valid metrics.

rgroothuijsen avatar May 09 '22 00:05 rgroothuijsen

@rgroothuijsen Good point. We'd either need to wrap the registry in our own abstraction that does these checks or add a error counter to client_golang for such cases. Later would require some discussion on the prometheus-dev mailinglist I guess.

discordianfish avatar Jun 07 '22 10:06 discordianfish