Log spam
Hello,
we updated our blackbox exporter and are now getting a lot of log messages due to the current implementation of slog introduced with https://github.com/prometheus/blackbox_exporter/pull/1311 as it removes the default none logging option.
We have tens of thousands of probes running each minute. Some of them are expected to fail but will cause the logger to print an error message. We are now producing over 14k of logs per second.
I can see two possible solutions for this where I prefer the first one.
1. Adjust the log level for some message
Currently, a probe failure is logged with level ERROR. IMO this should be level DEBUG (or WARN). My interpretation of level ERROR is that blackbox exporter itself is having an issue. Like failed to read config file, failed to unmarshal config file, failed binding to port. Things that are real issues. A failing probe can have different reasons, but does not affect the blackbox exporter.
2. Add back the none logging option
Add the none option by writing logs to io.Discard like in NewNopLogger().
Instead, it would also be possible to add a flag to define where to write the logs. For example stderr, stdout, /dev/null.
I just noticed that too, after discovering a shocking 160 million log lines generated (and ingested into Loki) in a mere week.
Somewhat surprised that there is no way to revert this behavior so have opted in for pinning the previous version of the chart for now.
I understand the motivation behind the logging library change but that should be a breaking change (or a massive warning in the release notes) if you ask me.
CC @tjhop care to look into this one?
@Jakob3xD I agree with your interpretation of where error should be used. I'm hesitant to re-introduce the none option again since it was "non-standard" from the accepted level values/promlog package usage in other prometheus projects. I say that in reference to the --log.level flag mostly, since the --log.prober flag is unique to this exporter.
By defaulting the flag to an empty string and improving how things are handled when unset, I believe I was able to restore the previously expected behavior. As mentioned in the linked PR:
TL;DR: Leave --log.prober flag unset and it won't do scrape-level logging
Opinions?
I agree, I don't think we want to re-introduce the none mode.
Leave --log.prober flag unset and it won't do scrape-level logging
I think this isn't working as intended. The default value for --log.prober is info. But even when I change --log.prober=debug, I'm still seeing it log probes as info level.
Also, the default value for --log.prober should be Debug().
I think this isn't working as intended. The default value for --log.prober is info. But even when I change --log.prober=debug, I'm still seeing it log probes as info level.
Am I misunderstanding what the purpose of the --log.prober flag is supposed to do? If I checkout v0.25.0 to look at behavior for the last release and run with --log.level=debug --log.prober=debug, I still see the probe logs output at info level as well:
ts=2025-03-25T03:42:24.214Z caller=level.go:71 module=http_2xx target=https://prometheus.io level=info msg="Beginning probe" probe=http timeout_seconds=4.5
ts=2025-03-25T03:42:24.214Z caller=level.go:71 module=http_2xx target=https://prometheus.io level=info msg="Resolving target address" target=prometheus.io ip_protocol=ip4
ts=2025-03-25T03:42:24.215Z caller=level.go:71 module=http_2xx target=https://prometheus.io level=info msg="Resolved target address" target=prometheus.io ip=104.21.60.220
ts=2025-03-25T03:42:24.215Z caller=level.go:71 module=http_2xx target=https://prometheus.io level=info msg="Making HTTP request" url=https://104.21.60.220 host=prometheus.io
ts=2025-03-25T03:42:24.275Z caller=level.go:71 module=http_2xx target=https://prometheus.io level=info msg="Received HTTP response" status_code=200
ts=2025-03-25T03:42:24.275Z caller=level.go:71 module=http_2xx target=https://prometheus.io level=info msg="Response timings for roundtrip" roundtrip=0 start=2025-03-24T23:42:24.215637413-04:00 dnsDone=2025-03-24T23:42:24.215637413-04:00 connectDone=2025-03-24T23:42:24.227121089-04:00 gotConn=2025-03-24T23:42:24.24403057-04:00 responseStart=2025-03-24T23:42:24.275497629-04:00 tlsStart=2025-03-24T23:42:24.227146156-04:00 tlsDone=2025-03-24T23:42:24.243946663-04:00 end=2025-03-24T23:42:24.275811957-04:00
ts=2025-03-25T03:42:24.276Z caller=level.go:71 module=http_2xx target=https://prometheus.io level=info msg="Probe succeeded" duration_seconds=0.061779807
Also, the default value for --log.prober should be Debug().
I can do that. Should the behavior of --log.prober be dependent on --log.level, then?
I think this isn't working as intended. The default value for
--log.proberisinfo. But even when I change--log.prober=debug, I'm still seeing it log probes asinfolevel.Also, the default value for
--log.probershould beDebug().
isn't debug mode generally a 'catch all'? If I set this flag to 'warn' it stops the info logging for sure. this is unique to v0.26.0 release. I would think default should not be info.
@SuperQ can you clarify for us what the expected behaviour of --log.prober is. As far as I understand, it is similar to --log.level as it defines what log level should be printed for probe logs.
From your comment, I would assume that you expect the --log.probe to set to which level the probe logs should be sent. Meaning --log.probe set to debug results in all probe logs on debug level which then should not be printed if the --log.level is set to info?
If this is correct, then this would IMO not be a good solution for the problem, as we lose the use of levels for probe logs. Specially because those levels are requested by other issues, see #1401.
From what I have read so far, I see two solutions.
- Adjust the log levels of probe logs to not be higher as WARN if they are not real issues and keep the
--log.proberargument so when setting it to level error we don't see those probe logs. - Similar to 1 but drop the
--log.proberargument and only use the--log.levelargument.
Option 1 allows a finer filtering of logs as the exporter logs and probe logs are separated where option 2 removes the confusing extra argument.
Yea, I don't know what the original intention of --log.prober was supposed to look like.
I feel like it was just hacked in based on how the old logging library worked.
After some more thought and discussion with @SuperQ on slack, I updated #1381 based on our current best understanding of how the --log.prober flag was intended to behave. It now sets the value that probe scrape logs will be emitted at. Probe logs default to debug level, so they will be not be seen unless A) the log level is increased or B) the probe log level is decreased.
More details in the PR, feedback welcome 👍
me and @tjhop discussed the --log.level and --log.prober a bit more in the CNCF slack in #prometheus-exporters-dev.
I am summarising the discussion and looking for input/feedback from others in this thread.
Right now, --log.level and --log.prober flags are coupled, and if you want probe logs at debug, you also have to set log.level=debug
This is a bit unintuitive and has been a source for confusion around the logs. @tjhop hashed out the details and made a nice railroad style diagram to describe the current behavour more clearly in this comment
we are proposing that we should decouple these and have two independent loggers.
-
--log.levelwould control just the application logs. -
--log.proberwould control just the probe logs.
old behavior (current):
- main logger writes to stderr
- probe logger is wrapper logger that writes to main logger
- probe logger writes at stderr at static level
new behavior (proposed):
- main logger writes to stderr
- probe logger is second logger that writes to stderr
- probe logger is used as a standard leveled logger to write to various levels
It would a breaking change because it's a change in in behavior, but I want to make a case for this breaking change because it makes these flags simple to understand for us and our end users, and it also gives our users more flexibility.
because It would be a breaking change in behavior, we’d want to call it out clearly in the release notes and other places so users who are upgrading know about this change, and users who want the old behavior can adjust he config get the old behavior.
cc @SuperQ @mem @roidelapluie
@electron0zero we have more requests for greater flexibility/control over the probe logs in other threads as well, such as here: https://github.com/prometheus/blackbox_exporter/issues/1401#issuecomment-3153121472
cc: @winadm87
yeah, I feel like we should decouple these two flags, unless someone strongly objects to it. I think we have given enough time for feedback.
I would be happy to find some spare cycles to to review the PR.
closing this because https://github.com/prometheus/blackbox_exporter/pull/1461 is merged.
next release will include that change, and it should give more control over probe logs and program logs.
@tjhop Thank you for the work and effort. 🥳