node_exporter icon indicating copy to clipboard operation
node_exporter copied to clipboard

textfile: the same metric occurring with different labels in two different prom files trigger error due to inconsistent help message in processFile

Open zmedico opened this issue 4 years ago • 14 comments

Host operating system: output of uname -a

Linux hostname 5.4.65 #1 SMP Tue Sep 15 21:22:36 UTC 2020 x86_64 AMD EPYC 7551P 32-Core Processor AuthenticAMD GNU/Linux

node_exporter version: output of node_exporter --version

# node_exporter --version
node_exporter, version 1.0.1 (branch: HEAD, revision: 3715be6ae899f2a9b9dbfd9c39f3e09a7bd4559f)
  build user:       root@1f76dbbcfa55
  build date:       20200616-12:44:12
  go version:       go1.14.4

node_exporter command line flags

node_exporter -- --collector.textfile.directory=/var/lib/node_exporter/ --collector.bonding --collector.buddyinfo --collector.ntp --no-collector.infiniband --no-collector.nfs --no-collector.nfsd --no-collector.wifi --no-collector.zfs --web.listen-address=127.0.0.1:9100

Are you running node_exporter in Docker?

No.

What did you do that produced an error?

Write two different prom files to the textfile collector directory, containing metrics with the same names but different labels. For example:

$ cat /var/lib/node_exporter/app-instance-1.prom
# TYPE app_metric_name counter
app_metric_name{instance="1"} 0
$ cat /var/lib/node_exporter/app-instance-2.prom
# TYPE app_metric_name counter
app_metric_name{instance="2"} 0

What did you expect to see?

Expected to collect metrics with same names but different labels from both prom files.

What did you see instead?

Metrics were successfully collected from the first prom file, and metrics from the second prom file were rejected by the prometheus.checkMetricConsistency function at prometheus/registry.go#L914

The inconsistency can be avoided by replacing the full path of the prom file with the parent directory name, like this:

diff --git a/collector/textfile.go b/collector/textfile.go
index 50c1807..f464d44 100644
--- a/collector/textfile.go
+++ b/collector/textfile.go
@@ -254,7 +254,7 @@ func (c *textFileCollector) processFile(name string, ch chan<- prometheus.Metric
 
 	for _, mf := range families {
 		if mf.Help == nil {
-			help := fmt.Sprintf("Metric read from %s", path)
+			help := fmt.Sprintf("Metric read from %s", c.path)
 			mf.Help = &help
 		}
 	}

The error was not logged, but I was able to log it with the following patch: * [from Gatherer #2] collected metric app_metric_name label:<name:"instance" value:"2" > counter:<value:0 > has help "Metric read from /var/lib/node_exporter/app-instance-2.prom" but should have "Metric read from /var/lib/node_exporter/app-instance-1.prom"

--- node_exporter-1.0.1/node_exporter.go
+++ node_exporter-1.0.1/node_exporter.go
@@ -31,6 +31,7 @@
 	"github.com/prometheus/node_exporter/collector"
 	"github.com/prometheus/node_exporter/https"
 	kingpin "gopkg.in/alecthomas/kingpin.v2"
+	stdlog "log"
 )
 
 // handler wraps an unfiltered http.Handler but uses a filtered handler,
@@ -121,6 +122,7 @@
 	handler := promhttp.HandlerFor(
 		prometheus.Gatherers{h.exporterMetricsRegistry, r},
 		promhttp.HandlerOpts{
+			ErrorLog:            stdlog.New(os.Stderr, "", stdlog.LstdFlags),
 			ErrorHandling:       promhttp.ContinueOnError,
 			MaxRequestsInFlight: h.maxRequests,
 			Registry:            h.exporterMetricsRegistry,

zmedico avatar Nov 14 '20 00:11 zmedico

Looks like this is the same root cause as https://github.com/prometheus/node_exporter/issues/1600.

zmedico avatar Nov 14 '20 04:11 zmedico

The inconsistency can be avoided by replacing the full path of the prom file with the parent directory name, like this:

Might be a good option. @SuperQ wdyt? And maybe @juliusv has an opinion on this.

discordianfish avatar Nov 24 '20 10:11 discordianfish

IIRC, @SuperQ previously said that the use case of having the same metric in two files was explicitly not supported.

The directory name wouldn't be much use, as it'd be the same across all metrics so we'd no longer be defaulting in a value to help you figure out which metric came from which file.

brian-brazil avatar Nov 24 '20 10:11 brian-brazil

Makes sense, but I still like to support the use case of having two separate processes report the same metric via the textfile exporter. If they'd need to write the same metrics file, they need to coordinate/merge their state somehow which seems annoying.. Any alternative suggestions?

discordianfish avatar Nov 24 '20 10:11 discordianfish

If someone wants to contribute code for it, the most optimal solution to this problem would seem to be to group identically named metrics from different files into a single output group with just one HELP comment saying something like Metric read from <file-1> and <file-2>.

juliusv avatar Nov 24 '20 10:11 juliusv

Any alternative suggestions?

If both files contain the same explicit HELP then this logic doesn't apply, and it'll all work.

brian-brazil avatar Nov 24 '20 11:11 brian-brazil

This is true, but it would be even better to make things work automatically even if the user forgot to provide their own HELP strings. Especially since this is not an error that you notice on startup, so for many people this will be a silent error that they don't notice unless they're super ontop of their monitoring.

juliusv avatar Nov 24 '20 13:11 juliusv

That's partially due to the ContinueOnError, which is a dangerous setting as it hides failures like this in a non-deterministic way.

brian-brazil avatar Nov 24 '20 14:11 brian-brazil

Agreed, the Node Exporter sets ContinueOnError here: https://github.com/prometheus/node_exporter/blob/202ecf9c9d1d1960cc9cac24838d13e9cff5edca/node_exporter.go#L124

That seems dangerous, but I'm not deep enough into the Node Exporter to say what kinds of things it would break if we set that to e.g. HTTPErrorOnError instead.

juliusv avatar Nov 24 '20 14:11 juliusv

Basically if there's a problem the whole scrape would fail with HTTPErrorOnError, rather than effectively silently failing in a non-deterministic way like it current does.

brian-brazil avatar Nov 24 '20 15:11 brian-brazil

Yes, that much is clear. I was just wondering if that would be too problematic to change it now. I was wondering why ContinueOnError was initially introduced into the Node Exporter, and it was in https://github.com/prometheus/node_exporter/commit/dace41e3d4c3dfc3a52fa73fcba322082dcce22a, explicitly for the textfile collector.

juliusv avatar Nov 24 '20 15:11 juliusv

same problem

lymweb avatar May 31 '21 09:05 lymweb

Yeah this is still an open issue. Feel free to implement Julius suggestion:

If someone wants to contribute code for it, the most optimal solution to this problem would seem to be to group identically named metrics from different files into a single output group with just one HELP comment saying something like Metric read from and .

discordianfish avatar Jun 08 '21 08:06 discordianfish

Ouch, I also have hit that, and would benefit from such metrics merging.

wkhere avatar Jan 27 '22 01:01 wkhere