blackbox_exporter icon indicating copy to clipboard operation
blackbox_exporter copied to clipboard

Log spam

Open Jakob3xD opened this issue 10 months ago • 11 comments

Hello, we updated our blackbox exporter and are now getting a lot of log messages due to the current implementation of slog introduced with https://github.com/prometheus/blackbox_exporter/pull/1311 as it removes the default none logging option.

We have tens of thousands of probes running each minute. Some of them are expected to fail but will cause the logger to print an error message. We are now producing over 14k of logs per second.

I can see two possible solutions for this where I prefer the first one.

1. Adjust the log level for some message

Currently, a probe failure is logged with level ERROR. IMO this should be level DEBUG (or WARN). My interpretation of level ERROR is that blackbox exporter itself is having an issue. Like failed to read config file, failed to unmarshal config file, failed binding to port. Things that are real issues. A failing probe can have different reasons, but does not affect the blackbox exporter.

2. Add back the none logging option

Add the none option by writing logs to io.Discard like in NewNopLogger(). Instead, it would also be possible to add a flag to define where to write the logs. For example stderr, stdout, /dev/null.

Jakob3xD avatar Mar 04 '25 10:03 Jakob3xD

I just noticed that too, after discovering a shocking 160 million log lines generated (and ingested into Loki) in a mere week.

Somewhat surprised that there is no way to revert this behavior so have opted in for pinning the previous version of the chart for now.

I understand the motivation behind the logging library change but that should be a breaking change (or a massive warning in the release notes) if you ask me.

mateuszdrab avatar Mar 15 '25 12:03 mateuszdrab

CC @tjhop care to look into this one?

SuperQ avatar Mar 15 '25 13:03 SuperQ

@Jakob3xD I agree with your interpretation of where error should be used. I'm hesitant to re-introduce the none option again since it was "non-standard" from the accepted level values/promlog package usage in other prometheus projects. I say that in reference to the --log.level flag mostly, since the --log.prober flag is unique to this exporter.

By defaulting the flag to an empty string and improving how things are handled when unset, I believe I was able to restore the previously expected behavior. As mentioned in the linked PR:

TL;DR: Leave --log.prober flag unset and it won't do scrape-level logging

Opinions?

tjhop avatar Mar 18 '25 04:03 tjhop

I agree, I don't think we want to re-introduce the none mode.

Leave --log.prober flag unset and it won't do scrape-level logging

I think this isn't working as intended. The default value for --log.prober is info. But even when I change --log.prober=debug, I'm still seeing it log probes as info level.

Also, the default value for --log.prober should be Debug().

SuperQ avatar Mar 18 '25 07:03 SuperQ

I think this isn't working as intended. The default value for --log.prober is info. But even when I change --log.prober=debug, I'm still seeing it log probes as info level.

Am I misunderstanding what the purpose of the --log.prober flag is supposed to do? If I checkout v0.25.0 to look at behavior for the last release and run with --log.level=debug --log.prober=debug, I still see the probe logs output at info level as well:

ts=2025-03-25T03:42:24.214Z caller=level.go:71 module=http_2xx target=https://prometheus.io level=info msg="Beginning probe" probe=http timeout_seconds=4.5
ts=2025-03-25T03:42:24.214Z caller=level.go:71 module=http_2xx target=https://prometheus.io level=info msg="Resolving target address" target=prometheus.io ip_protocol=ip4
ts=2025-03-25T03:42:24.215Z caller=level.go:71 module=http_2xx target=https://prometheus.io level=info msg="Resolved target address" target=prometheus.io ip=104.21.60.220
ts=2025-03-25T03:42:24.215Z caller=level.go:71 module=http_2xx target=https://prometheus.io level=info msg="Making HTTP request" url=https://104.21.60.220 host=prometheus.io
ts=2025-03-25T03:42:24.275Z caller=level.go:71 module=http_2xx target=https://prometheus.io level=info msg="Received HTTP response" status_code=200
ts=2025-03-25T03:42:24.275Z caller=level.go:71 module=http_2xx target=https://prometheus.io level=info msg="Response timings for roundtrip" roundtrip=0 start=2025-03-24T23:42:24.215637413-04:00 dnsDone=2025-03-24T23:42:24.215637413-04:00 connectDone=2025-03-24T23:42:24.227121089-04:00 gotConn=2025-03-24T23:42:24.24403057-04:00 responseStart=2025-03-24T23:42:24.275497629-04:00 tlsStart=2025-03-24T23:42:24.227146156-04:00 tlsDone=2025-03-24T23:42:24.243946663-04:00 end=2025-03-24T23:42:24.275811957-04:00
ts=2025-03-25T03:42:24.276Z caller=level.go:71 module=http_2xx target=https://prometheus.io level=info msg="Probe succeeded" duration_seconds=0.061779807

tjhop avatar Mar 25 '25 03:03 tjhop

Also, the default value for --log.prober should be Debug().

I can do that. Should the behavior of --log.prober be dependent on --log.level, then?

tjhop avatar Mar 25 '25 03:03 tjhop

I think this isn't working as intended. The default value for --log.prober is info. But even when I change --log.prober=debug, I'm still seeing it log probes as info level.

Also, the default value for --log.prober should be Debug().

isn't debug mode generally a 'catch all'? If I set this flag to 'warn' it stops the info logging for sure. this is unique to v0.26.0 release. I would think default should not be info.

grimz-ly avatar Apr 02 '25 23:04 grimz-ly

@SuperQ can you clarify for us what the expected behaviour of --log.prober is. As far as I understand, it is similar to --log.level as it defines what log level should be printed for probe logs.

From your comment, I would assume that you expect the --log.probe to set to which level the probe logs should be sent. Meaning --log.probe set to debug results in all probe logs on debug level which then should not be printed if the --log.level is set to info?

If this is correct, then this would IMO not be a good solution for the problem, as we lose the use of levels for probe logs. Specially because those levels are requested by other issues, see #1401.

From what I have read so far, I see two solutions.

  1. Adjust the log levels of probe logs to not be higher as WARN if they are not real issues and keep the --log.prober argument so when setting it to level error we don't see those probe logs.
  2. Similar to 1 but drop the --log.prober argument and only use the --log.level argument.

Option 1 allows a finer filtering of logs as the exporter logs and probe logs are separated where option 2 removes the confusing extra argument.

Jakob3xD avatar Apr 24 '25 07:04 Jakob3xD

Yea, I don't know what the original intention of --log.prober was supposed to look like.

I feel like it was just hacked in based on how the old logging library worked.

SuperQ avatar Apr 24 '25 14:04 SuperQ

After some more thought and discussion with @SuperQ on slack, I updated #1381 based on our current best understanding of how the --log.prober flag was intended to behave. It now sets the value that probe scrape logs will be emitted at. Probe logs default to debug level, so they will be not be seen unless A) the log level is increased or B) the probe log level is decreased.

More details in the PR, feedback welcome 👍

tjhop avatar May 08 '25 04:05 tjhop

me and @tjhop discussed the --log.level and --log.prober a bit more in the CNCF slack in #prometheus-exporters-dev.

I am summarising the discussion and looking for input/feedback from others in this thread.

Right now, --log.level and --log.prober flags are coupled, and if you want probe logs at debug, you also have to set log.level=debug

This is a bit unintuitive and has been a source for confusion around the logs. @tjhop hashed out the details and made a nice railroad style diagram to describe the current behavour more clearly in this comment

we are proposing that we should decouple these and have two independent loggers.

  • --log.level would control just the application logs.
  • --log.prober would control just the probe logs.

old behavior (current):

  • main logger writes to stderr
  • probe logger is wrapper logger that writes to main logger
  • probe logger writes at stderr at static level

new behavior (proposed):

  • main logger writes to stderr
  • probe logger is second logger that writes to stderr
  • probe logger is used as a standard leveled logger to write to various levels

It would a breaking change because it's a change in in behavior, but I want to make a case for this breaking change because it makes these flags simple to understand for us and our end users, and it also gives our users more flexibility.

because It would be a breaking change in behavior, we’d want to call it out clearly in the release notes and other places so users who are upgrading know about this change, and users who want the old behavior can adjust he config get the old behavior.

cc @SuperQ @mem @roidelapluie

electron0zero avatar Jul 18 '25 13:07 electron0zero

@electron0zero we have more requests for greater flexibility/control over the probe logs in other threads as well, such as here: https://github.com/prometheus/blackbox_exporter/issues/1401#issuecomment-3153121472

cc: @winadm87

tjhop avatar Aug 11 '25 18:08 tjhop

yeah, I feel like we should decouple these two flags, unless someone strongly objects to it. I think we have given enough time for feedback.

I would be happy to find some spare cycles to to review the PR.

electron0zero avatar Aug 12 '25 12:08 electron0zero

closing this because https://github.com/prometheus/blackbox_exporter/pull/1461 is merged.

next release will include that change, and it should give more control over probe logs and program logs.

electron0zero avatar Oct 29 '25 12:10 electron0zero

@tjhop Thank you for the work and effort. 🥳

Jakob3xD avatar Oct 29 '25 15:10 Jakob3xD