collectd_exporter icon indicating copy to clipboard operation
collectd_exporter copied to clipboard

Incompatibility with collectd Plugin:ConnTrack

Open vbatoufflet opened this issue 7 years ago • 11 comments

Hi,

When exporting data from a collectd instance having the ConnTrack plugin activated, I get this error:

# curl -s -D /dev/stderr localhost:9103/metrics
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Fri, 26 Jan 2018 15:03:48 GMT
Content-Length: 292

An error has occurred during metrics collection:

collected metric collectd_conntrack label:<name:"conntrack" value:"max" > label:<name:"instance" value:"gw1.example.net" > gauge:<value:256000 >  has label dimensions inconsistent with previously collected metrics in the same metric family

It seems that there is a mismatch between received metrics over time making the exporter to cease to function properly.

Any idea of something obvious I missed here?

Kind regards, Vincent

vbatoufflet avatar Jan 26 '18 17:01 vbatoufflet

I have exactly the same issue when collecting data from Lede routers. Unfortunately I did not have the time to investigate this further, but at a first glance it seems that the problem is the located in client_golang and not in the collectd_exporter itself.

There is also a similar issue reported on telegraf: https://github.com/influxdata/telegraf/issues/2822

In the meantime the only workaround is to use the previous release. If using docker, pull prom/collectd-exporter:0.3.1 which works.

fmoessbauer avatar Jan 27 '18 10:01 fmoessbauer

The reason is a change in the library, but this has been discussed in various places and the result of these was that it is correct not to accept inconsistent label dimensions. What is happening is that for some reason, metrics like

metric_name{label1="value1"}
metric_name{label1="value1",label2="value2"}

are being generated. This is illegal, but this was not enforced in the past. The correct way is to pad all metrics to have the same set of label names, using empty values where there is nothing to put in a particular label. The above should be

metric_name{label1="value1"}
metric_name{label1="value1",label2="value2"}

How to do this best is somewhat dependent on the source of the metrics, so the client library cannot do this for you automatically (so it doesn't try).

matthiasr avatar Jan 27 '18 17:01 matthiasr

@beorn7 is there an FAQ entry or something for this?

matthiasr avatar Jan 27 '18 17:01 matthiasr

What client_golang implements here, is in the spec: https://prometheus.io/docs/instrumenting/exposition_formats/#protocol-buffer-format-details

One might argue only advanced use cases even run into that problem (if you hand-craft your metrics, like in this case where you have to mirror them from other metrics).

I'll have an ingenious(?) plan to provide an elegant way to deal with this use case, which will be the fix of both https://github.com/prometheus/client_golang/issues/355 and https://github.com/prometheus/client_golang/issues/47 . It's just a few SAT days away.

In different news, what @matthiasr meant as "the above should be", should actually be

metric_name{label1="value1",label2=""}
metric_name{label1="value1",label2="value2"}

beorn7 avatar Jan 29 '18 12:01 beorn7

Thanks for your replies.

Indeed, it seems that collectd's conntrack plugin is sending the conntrack metric sometime with its max value, and the other time with the actual one.

I'll try filter this metric for the moment to prevent failures on the instance hosting this exporter.

Is it possible to discard the faulty metrics preventing the whole scraping to fail when Prometheus requests data on the exporter?

vbatoufflet avatar Jan 29 '18 13:01 vbatoufflet

You can try your luck with ContinueOnError, see https://godoc.org/github.com/prometheus/client_golang/prometheus/promhttp#HandlerErrorHandling . But I'm not sure from the top of my head if that does what you want.

beorn7 avatar Jan 29 '18 14:01 beorn7

Following your input, I was able to make the exporter work with such metrics issue.

I made a PR (see #57) for that, hoping it'll match the requirements here.

vbatoufflet avatar Jan 29 '18 16:01 vbatoufflet

I've been hit by this problem hard today.. took me a while to understand why the health check was restarting a pod that was previously working.. the problem is that the exporter works.. then it receive wrongly formatted data and the http endpoints then output http 500 status code with errors.... I may understand the rationale to fail hard instead of just being resilient (at least in a generic case, but this is an app that collect data! should be able to handle some wrong input without crashing..) and signal the issue to the app mantainer....

but a better approach in my opinion would be to use the log to log the wrongly formatted ( and discarded metrics )

let me explain: this application receive thousands of metrics, from tens of servers ( in my case at least ) and if someone somewhere sometime send something wrong the whole application return error 500, without any metrics printed , as collateral this triggered the pod livenessProbe ( which now I've replaced with only a tcp socket ) very weird behaviour for an application which is live but in incompatible state with liveness httpGet probe of kubernetes defaults ( which threats error 500 as failure of the whole application )

( if you intention is to not restart the whole application when such error occurrs ( which also doesn't depend on this application but on external input data ) this means that this application is incompatible with httpGet livenessProbe which is counter-intuitive behaviour.. being an http endpoint.. )

.. ( wrong metrics could be outputted as comments at least.. ) and also will return 0 data (of some thousands metrics which are fine ) to the scraper..

In my opinion your approach it's like a database crash when receive a wrong INPUT query

my thoughts are that you should just output a ERROR in app logs which maybe can be downgraded to a WARN with a command flag ( something like --wrong-input-warn )

thanks, Francesco

fvigotti avatar Sep 06 '18 15:09 fvigotti

@fvigotti: I totally agree with your statement. Ill-formed input from a single client should definitely not result in a total crash. Otherwise this exporter is simply not suited for large-scale deployments.

Conceptually spoken the incoming data has to be validated by the ingress API and only valid metrics should be passed to later processing steps. The comparison with the database perfectly illustrates this.

fmoessbauer avatar Sep 10 '18 16:09 fmoessbauer

While I agree that there shouldn't be a way to DOS collectd exporter like this and this needs to be fixed in both collectd and collectd_exporter, I'll document a workaround to add into your collectd.conf for those in trouble for now.

LoadPlugin match_regex
LoadPlugin target_set

<Chain "PreCache">
  <Rule "CONNTRACK-FIX-TYPE">
    <Match "regex">
      Plugin                    "^conntrack$"
      TypeInstance              "^$"
    </Match>
    <Target "set">
      TypeInstance              "current"
    </Target>
  </Rule>

  <Target "write">
  </Target>
</Chain>

hasso avatar Sep 12 '18 07:09 hasso

Note https://github.com/prometheus/client_golang/pull/417 . In short: After much back and forth, it was decided that inconsistent label dimensions are actually fine. (The library will still only expose valid metric data. Just that "missing labels" are now valid.)

beorn7 avatar Sep 13 '18 23:09 beorn7