node_exporter
node_exporter copied to clipboard
textfile: the same metric occurring with different labels in two different prom files trigger error due to inconsistent help message in processFile
Host operating system: output of uname -a
Linux hostname 5.4.65 #1 SMP Tue Sep 15 21:22:36 UTC 2020 x86_64 AMD EPYC 7551P 32-Core Processor AuthenticAMD GNU/Linux
node_exporter version: output of node_exporter --version
# node_exporter --version
node_exporter, version 1.0.1 (branch: HEAD, revision: 3715be6ae899f2a9b9dbfd9c39f3e09a7bd4559f)
build user: root@1f76dbbcfa55
build date: 20200616-12:44:12
go version: go1.14.4
node_exporter command line flags
node_exporter -- --collector.textfile.directory=/var/lib/node_exporter/ --collector.bonding --collector.buddyinfo --collector.ntp --no-collector.infiniband --no-collector.nfs --no-collector.nfsd --no-collector.wifi --no-collector.zfs --web.listen-address=127.0.0.1:9100
Are you running node_exporter in Docker?
No.
What did you do that produced an error?
Write two different prom files to the textfile collector directory, containing metrics with the same names but different labels. For example:
$ cat /var/lib/node_exporter/app-instance-1.prom
# TYPE app_metric_name counter
app_metric_name{instance="1"} 0
$ cat /var/lib/node_exporter/app-instance-2.prom
# TYPE app_metric_name counter
app_metric_name{instance="2"} 0
What did you expect to see?
Expected to collect metrics with same names but different labels from both prom files.
What did you see instead?
Metrics were successfully collected from the first prom file, and metrics from the second prom file were rejected by the prometheus.checkMetricConsistency function at prometheus/registry.go#L914
The inconsistency can be avoided by replacing the full path of the prom file with the parent directory name, like this:
diff --git a/collector/textfile.go b/collector/textfile.go
index 50c1807..f464d44 100644
--- a/collector/textfile.go
+++ b/collector/textfile.go
@@ -254,7 +254,7 @@ func (c *textFileCollector) processFile(name string, ch chan<- prometheus.Metric
for _, mf := range families {
if mf.Help == nil {
- help := fmt.Sprintf("Metric read from %s", path)
+ help := fmt.Sprintf("Metric read from %s", c.path)
mf.Help = &help
}
}
The error was not logged, but I was able to log it with the following patch:
* [from Gatherer #2] collected metric app_metric_name label:<name:"instance" value:"2" > counter:<value:0 > has help "Metric read from /var/lib/node_exporter/app-instance-2.prom" but should have "Metric read from /var/lib/node_exporter/app-instance-1.prom"
--- node_exporter-1.0.1/node_exporter.go
+++ node_exporter-1.0.1/node_exporter.go
@@ -31,6 +31,7 @@
"github.com/prometheus/node_exporter/collector"
"github.com/prometheus/node_exporter/https"
kingpin "gopkg.in/alecthomas/kingpin.v2"
+ stdlog "log"
)
// handler wraps an unfiltered http.Handler but uses a filtered handler,
@@ -121,6 +122,7 @@
handler := promhttp.HandlerFor(
prometheus.Gatherers{h.exporterMetricsRegistry, r},
promhttp.HandlerOpts{
+ ErrorLog: stdlog.New(os.Stderr, "", stdlog.LstdFlags),
ErrorHandling: promhttp.ContinueOnError,
MaxRequestsInFlight: h.maxRequests,
Registry: h.exporterMetricsRegistry,
Looks like this is the same root cause as https://github.com/prometheus/node_exporter/issues/1600.
The inconsistency can be avoided by replacing the full path of the prom file with the parent directory name, like this:
Might be a good option. @SuperQ wdyt? And maybe @juliusv has an opinion on this.
IIRC, @SuperQ previously said that the use case of having the same metric in two files was explicitly not supported.
The directory name wouldn't be much use, as it'd be the same across all metrics so we'd no longer be defaulting in a value to help you figure out which metric came from which file.
Makes sense, but I still like to support the use case of having two separate processes report the same metric via the textfile exporter. If they'd need to write the same metrics file, they need to coordinate/merge their state somehow which seems annoying.. Any alternative suggestions?
If someone wants to contribute code for it, the most optimal solution to this problem would seem to be to group identically named metrics from different files into a single output group with just one HELP comment saying something like Metric read from <file-1> and <file-2>.
Any alternative suggestions?
If both files contain the same explicit HELP then this logic doesn't apply, and it'll all work.
This is true, but it would be even better to make things work automatically even if the user forgot to provide their own HELP strings. Especially since this is not an error that you notice on startup, so for many people this will be a silent error that they don't notice unless they're super ontop of their monitoring.
That's partially due to the ContinueOnError, which is a dangerous setting as it hides failures like this in a non-deterministic way.
Agreed, the Node Exporter sets ContinueOnError here: https://github.com/prometheus/node_exporter/blob/202ecf9c9d1d1960cc9cac24838d13e9cff5edca/node_exporter.go#L124
That seems dangerous, but I'm not deep enough into the Node Exporter to say what kinds of things it would break if we set that to e.g. HTTPErrorOnError instead.
Basically if there's a problem the whole scrape would fail with HTTPErrorOnError, rather than effectively silently failing in a non-deterministic way like it current does.
Yes, that much is clear. I was just wondering if that would be too problematic to change it now. I was wondering why ContinueOnError was initially introduced into the Node Exporter, and it was in https://github.com/prometheus/node_exporter/commit/dace41e3d4c3dfc3a52fa73fcba322082dcce22a, explicitly for the textfile collector.
same problem
Yeah this is still an open issue. Feel free to implement Julius suggestion:
If someone wants to contribute code for it, the most optimal solution to this problem would seem to be to group identically named metrics from different files into a single output group with just one HELP comment saying something like Metric read from
and .
Ouch, I also have hit that, and would benefit from such metrics merging.