telegraf icon indicating copy to clipboard operation
telegraf copied to clipboard

failed to parse "system_uptime_format"

Open quinten-lp opened this issue 11 months ago • 14 comments

Hi, here a issue I have with an update to the last version of telegraf.

Relevant telegraf.conf

here the telegraf configuration file :

[[outputs.prometheus_client]]
    listen = "0.0.0.0:9126"
[[outputs.http]]
    url = "https://<url>/api/v1/write"
    data_format = "prometheusremotewrite"
    tls_ca = "/etc/monitoring/ca.crt"
    tls_cert = "/etc/monitoring/cert.crt"
    tls_key = "/etc/monitoring/cert.key"
    timeout = "30s"
    [outputs.http.headers]
    Content-Type = "application/x-protobuf"
    Content-Encoding = "snappy"
    X-Prometheus-Remote-Write-Version = "0.1.0"

[[inputs.cpu]]
    percpu = true
[[inputs.disk]]
    ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
[[inputs.diskio]]
[[inputs.mem]]
[[inputs.system]]
[[inputs.swap]]
[[inputs.netstat]]
[[inputs.nstat]]
[[inputs.kernel]]
[[inputs.processes]]
[[inputs.interrupts]]
[[inputs.linux_sysctl_fs]]

Logs from Telegraf

here the logs :

2024-12-30T14:27:39Z E! [serializers.prometheusremotewrite::http] some series were dropped, 555 series left to send; last recorded error: failed to parse "system_uptime_format": bad sample value " 3:51"
2024-12-30T14:27:49Z E! [serializers.prometheusremotewrite::http] some series were dropped, 612 series left to send; last recorded error: failed to parse "system_uptime_format": bad sample value " 3:51"
2024-12-30T14:27:59Z E! [serializers.prometheusremotewrite::http] some series were dropped, 612 series left to send; last recorded error: failed to parse "system_uptime_format": bad sample value " 3:51"

System info

telegrad 1.33.0, debian 11 + debian 12

Docker

No response

Steps to reproduce

  1. Install telegraf 1.33.0
  2. Run telegraf service (or telegraf --debug command line)
  3. check logs ...

Expected behavior

Running telegraf without errors logs

Actual behavior

Installation and running telegraf is OK, sending metrics to prometheus is OK but I have a lot à error in my logs like this :

[serializers.prometheusremotewrite::http] some series were dropped, 612 series left to send; last recorded error: failed to parse "system_uptime_format": bad sample value " 3:52"

Additional info

No errors with 1.32.1 version

quinten-lp avatar Dec 30 '24 14:12 quinten-lp

I suspect it has to do with the space at the beginning of the sample value, but it is very annoying.

jfinkhaeuser avatar Jan 05 '25 15:01 jfinkhaeuser

I can confirm that this issue exists in 1.33.0, but not in 1.32.3. I just downgraded to 1.32.3-1 of the influxdata apt packages, and do not have the issue there.

jfinkhaeuser avatar Jan 05 '25 15:01 jfinkhaeuser

This is also an observed issue with the Haproxy Input plugin:

E! [serializers.prometheusremotewrite::http] some series were dropped, 4842 series left to send; last recorded error: failed to parse "haproxy_status": bad sample value "UP"

m-wack avatar Jan 15 '25 12:01 m-wack

We are also seeing this with more "normal" floating values, like here:

2025-02-06T00:00:03Z E! [serializers.prometheusremotewrite::http] some series were dropped, 1 series left to send; last recorded error: failed to parse "{PLACEHOLDER}": bad sample value "1.4545500000000000"

m-wack avatar Feb 06 '25 08:02 m-wack

Could someone please add an [[outputs.file]] output and provide some metrics causing the issue? This way I can reproduce and debug the issue on my side... We updated the prometheus package between those versions and it seems like they got some surprises for us. ;-)

srebhan avatar Feb 26 '25 20:02 srebhan

Hi, here an example :

system,host=instance.example.org uptime=11388i 1740671800000000000
system,host=instance.example.org uptime_format=" 3:09" 1740671800000000000
system,host=instance.example.org uptime=11398i 1740671810000000000
system,host=instance.example.org uptime_format=" 3:09" 1740671810000000000
system,host=instance.example.org uptime=11408i 1740671820000000000
system,host=instance.example.org uptime_format=" 3:10" 1740671820000000000
system,host=instance.example.org uptime=11418i 1740671830000000000
system,host=instance.example.org uptime_format=" 3:10" 1740671830000000000
system,host=instance.example.org uptime=11428i 1740671840000000000
system,host=instance.example.org uptime_format=" 3:10" 1740671840000000000
system,host=instance.example.org uptime=11438i 1740671850000000000
system,host=instance.example.org uptime_format=" 3:10" 1740671850000000000
system,host=instance.example.org uptime=11448i 1740671860000000000
system,host=instance.example.org uptime_format=" 3:10" 1740671860000000000
system,host=instance.example.org uptime=11458i 1740671870000000000
system,host=instance.example.org uptime_format=" 3:10" 1740671870000000000
system,host=instance.example.org uptime=11468i 1740671880000000000
system,host=instance.example.org uptime_format=" 3:11" 1740671880000000000
system,host=instance.example.org uptime=11478i 1740671890000000000
system,host=instance.example.org uptime_format=" 3:11" 1740671890000000000
system,host=instance.example.org uptime=11488i 1740671900000000000
system,host=instance.example.org uptime_format=" 3:11" 1740671900000000000
system,host=instance.example.org uptime=11498i 1740671910000000000
system,host=instance.example.org uptime_format=" 3:11" 1740671910000000000
system,host=instance.example.org uptime=11508i 1740671920000000000
system,host=instance.example.org uptime_format=" 3:11" 1740671920000000000
system,host=instance.example.org uptime=11518i 1740671930000000000
system,host=instance.example.org uptime_format=" 3:11" 1740671930000000000
system,host=instance.example.org uptime=11528i 1740671940000000000
system,host=instance.example.org uptime_format=" 3:12" 1740671940000000000
system,host=instance.example.org uptime=11538i 1740671950000000000
system,host=instance.example.org uptime_format=" 3:12" 1740671950000000000
system,host=instance.example.org uptime=11548i 1740671960000000000
system,host=instance.example.org uptime_format=" 3:12" 1740671960000000000
system,host=instance.example.org uptime=11558i 1740671970000000000
system,host=instance.example.org uptime_format=" 3:12" 1740671970000000000
system,host=instance.example.org uptime=11568i 1740671980000000000
system,host=instance.example.org uptime_format=" 3:12" 1740671980000000000
system,host=instance.example.org uptime=11578i 1740671990000000000
system,host=instance.example.org uptime_format=" 3:12" 1740671990000000000
system,host=instance.example.org uptime=11588i 1740672000000000000
system,host=instance.example.org uptime_format=" 3:13" 1740672000000000000
system,host=instance.example.org uptime=11598i 1740672010000000000
system,host=instance.example.org uptime_format=" 3:13" 1740672010000000000
system,host=instance.example.org uptime=11608i 1740672020000000000
system,host=instance.example.org uptime_format=" 3:13" 1740672020000000000
system,host=instance.example.org uptime=11618i 1740672030000000000
system,host=instance.example.org uptime_format=" 3:13" 1740672030000000000
system,host=instance.example.org uptime=11628i 1740672040000000000
system,host=instance.example.org uptime_format=" 3:13" 1740672040000000000
system,host=instance.example.org uptime=11638i 1740672050000000000
system,host=instance.example.org uptime_format=" 3:13" 1740672050000000000
system,host=instance.example.org uptime=11648i 1740672060000000000
system,host=instance.example.org uptime_format=" 3:14" 1740672060000000000

My output config was :

[[outputs.file]]
  ## Files to write to, "stdout" is a specially handled file.
  files = ["stdout", "/tmp/metrics.out"]
  data_format = "influx"

My telegraf version is Telegraf 1.32.3 (git: HEAD@2fd5bf4f)

quinten-lp avatar Feb 27 '25 16:02 quinten-lp

Can you test again with

[[inputs.system]]
  fieldexclude = ["uptime_format"]

That should do the trick, as prometheus can't handle string fields..

Hipska avatar Mar 14 '25 16:03 Hipska

The issue is, that this is happening with other outputs as well, not just uptime_format and also that this is entirely resolved by downgrading to 1.32, so there seems to have been a regression somewhere in 1.33.

m-wack avatar Mar 24 '25 07:03 m-wack

The reason for this error message is this pull: https://github.com/influxdata/telegraf/pull/15893 that was introduced in v1.33.0

It would be nice to have an option to ignore them

jryberg avatar Apr 17 '25 12:04 jryberg

@Hipska , Since this is affecting many different inputs then this change should be reverted or optional? The output from Telegraf are spamming our environments and I cannot control any of the inputs that are generating errors to mitigate this issue.

The topic to this issue might change to "Many bundled inputs are not fully compatible with prometheusremotewrite"

jryberg avatar Apr 23 '25 08:04 jryberg

Another option is changing loglevel for your prometheus output.

The inputs cannot know to which outputs their metrics will be sent, so processing in between should make them compatible.

Hipska avatar Apr 23 '25 10:04 Hipska

Another option is changing loglevel for your prometheus output.

The inputs cannot know to which outputs their metrics will be sent, so processing in between should make them compatible.

It's logging it as errors: https://github.com/influxdata/telegraf/blob/bcea4c278e0b066ba418a58e9e38a89b99105e34/plugins/serializers/prometheusremotewrite/prometheusremotewrite.go#L202

I cannot set it to any lower severity

jryberg avatar Apr 23 '25 11:04 jryberg

The only other way is making sure no string fields exist in your metrics or choose to add them as labels.

I agree those logs should not be at Error level, Warning might have been better.

Hipska avatar Apr 23 '25 12:04 Hipska

The only other way is making sure no string fields exist in your metrics or choose to add them as labels.

I agree those logs should not be at Error level, Warning might have been better.

I will make a pull request to have the log level changed to warning and then I can change log_level to "error" to hide them for now.

However, it's not only custom metrics that are affected. The same issue applies for Telegraf input plugins such as redis so this is a very broad issue that are affecting many different inputs, not only custom metrics.

jryberg avatar Apr 23 '25 12:04 jryberg

It's logging it as errors:

@jfinkhaeuser since v1.34.4 the log-level is warning (see PR #16865) so you don't need a PR. ;-)

srebhan avatar Jul 02 '25 12:07 srebhan

However, it's not only custom metrics that are affected. The same issue applies for Telegraf input plugins such as redis so this is a very broad issue that are affecting many different inputs, not only custom metrics.

Well this affects many input for prometheus serialization and ONLY for serializations that cannot deal with strings. This is an serialization format limitation! We will certainly not remove string fields just because some random output cannot deal with it! The warning is there to notify users about the fact that some fields are missing. We can accept a PR to only output a log-line for each field once but the warning is there for a reason, i.e. to allow the users to find their "missing" fields and get a glue of what's going on...

srebhan avatar Jul 02 '25 12:07 srebhan

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you!

telegraf-tiger[bot] avatar Jul 16 '25 18:07 telegraf-tiger[bot]