opentelemetry-collector `[receiver/scraperhelper]`: Allow Scraper errors to be logged at `DEBUG` level

Is your feature request related to a problem? Please describe. Scrapers do not currently have any control over what level their errors are logged at. Sometimes, for example in the hostmetrics process scraper, the scraper will encounter errors that ideally should not be logged at ERROR level. Currently the solution was to completely suppress an error that was particularly problematic, however nearly all errors can be encountered in cases where it would be great to keep them logged at a DEBUG level, rather than being limited to completely muting them.

Describe the solution you'd like (Note: Names are demonstrative for this issue and not necessarily final) Add a new method to the Scraper interface that will check if the scraper’s errors should be logged at DEBUG level called Scraper.LogErrorsAtDebug() and returns a bool. The baseScraper will get a new bool field scraper.logErrorsAtDebug. A new ScraperOption function will be exported from scraperhelper called WithDebugErrorLogs, which will return a ScraperOption that sets baseScraper.logErrorsAtDebug to true. In scrapercontroller.go:scrapeMetricsAndReport() when scraper errors are received a new check will be added using Scraper.LogErrorsAtDebug() to check whether Zap should log the error at DEBUG or ERROR level.

See a first draft of the proposed solution here: https://github.com/braydonk/opentelemetry-collector/tree/ScraperErrorDebug/receiver/scraperhelper

Describe alternatives you've considered

Change the scrapercontroller to log every error at DEBUG
- This would be a change for every existing scraper, so a fair bit more revolutionary than the existing one which is an optional feature. There is a good chance that this is the right direction overall though.
Provide a way to configure a particular log level
- While a case could maybe be made for WARN, it seems pretty unlikely that errors would ever need to be logged at anything but ERROR or DEBUG
Specifically fix the problem that inspired this request, and make the hostmetrics process scraper mute all errors, similar to the first PR submitted to mitigate this problem
- This wasn’t even the most desirable option originally, since this completely mutes the error so any time it’s encountered it is not logged. This has the potential to mute actual issues the user should be seeing. To extend this to the other errors in this scraper would perpetuate and exacerbate this element of technical debt.

Additional context Here are two issues outlining the problem with the hostmetrics process scraper. #2758 #3004

Someone suggested a fix to an upstream library (https://github.com/open-telemetry/opentelemetry-collector/issues/3004#issuecomment-953529116) to mitigate the original issues seen with the process scraper. This is still something I plan to look into, however it won't attack the root of this issue, as we're seeing lots of scenarios where processes that seem fine to scrape but have some other problem (i.e. not having a "command") and generate similar ERROR spam every collection interval. The goal with this feature request is to provide a new mechanism for all Scrapers to potentially utilize, rather than needing to continuously band-aid fix the particular scraper at issue.

Mar 22 '22 13:03 braydonk

Thanks for creating the ticket! I was looking into that and have some other ideas how to approach the problem. I'll assign it to myself unless you or someone else is willing to help.

Mar 23 '22 06:03 dmitryax

I would love to help, since I'm very invested in the outcome. It would be great to discuss what your ideas were and compare them to what I'd considered.

Mar 23 '22 11:03 braydonk

As discussed in the SIG meeting on Mar 23, it may be helpful to provide a solution for all components, not just the scrapers.

Mar 23 '22 16:03 codeboten

In the https://github.com/lightstep/opentelemetry-prometheus-sidecar I dealt with repeating errors in two ways:

Log them once per callsite per minute. This uses reflection to determine the source location of each log, here's a library: https://github.com/lightstep/opentelemetry-prometheus-sidecar/tree/main/telemetry/doevery
Emit metrics about the number of failures of each kind.

(1) Allows operators to see something bad is happening at a glance without overwhelming the server logs, (2) allows operators to have ongoing monitoring. Neither costs so much as always logging errors.

Mar 23 '22 16:03 jmacd

Hello! I would like to revive this discussion. We are still interested in finding a solution to this problem, even if it's not what I suggested in the issue. @dmitryax did you happen to come up with anything for this by chance?

From the suggestion above, I like the idea of emitting metrics instead if that's preferable. I also experimented with a zap log core that would guard logs that have already been made in the last (x amount of time) but that also feels like a bandaid solution.

Sep 07 '22 12:09 braydonk

Discussed this in the SIG meeting today. We think that it would be useful to have something like what @jmacd described. This probably requires a short design doc. If anybody is willing to work on this the maintainers will be happy to review the design.

Sep 21 '22 16:09 tigrannajaryan

I agree with the notion that this should be a concern for the collector as a whole. From that perspective, what we are talking about here is configuration of the collector's own telemetry.

The collector config already contains a dedicated section for configuring its own logging behavior. I suggest that we try to solve this problem by adding additional configuration options to the existing config section, and implement corresponding changes to the construction of the collector's logger.

As mentioned in the meeting today, zapcore has some sampling capabilities, so it seems likely that some settings can be exposed in the config that will result in usage of these features.

Sep 21 '22 16:09 djaglowski

Was this fixed by #6404? @braydonk anything left to do here?

Aug 11 '23 15:08 mx-psi

It alleviates the log spam problem, but I don't think it resolves the actual core problem. This is more specifically about scraper control.

In the hostmetricsreceiver for example, I originally added a configuration mute_process_name_error which will hack the scraper to essentially just swallow the error. This config isn't ideal, and it would be better for the scraper to have control over the importance of the errors returned from it. In this particular scenario with process name error, these are essentially errors that we expect to happen for certain processes without names (kernel processes for example) which means we were in a scenario where we logged errors that were expected with no possible user action. That's what I was really hoping to resolve here.

I think that original soul of the issue was lost in the weeds of log spamming issues. To be fair though, I no longer agree with my original suggestions either. Perhaps I can open a new issue that states this more clearly with a better suggestion?

Aug 15 '23 13:08 braydonk

@braydonk, thanks for the update. I agree a new issue is appropriate so we can focus on what specifically is being proposed at this point.

Aug 16 '23 13:08 djaglowski

Thanks @djaglowski I will write that up today. I'll close this one for now.

Aug 16 '23 14:08 braydonk

opentelemetry-collector opentelemetry-collector copied to clipboard

`[receiver/scraperhelper]`: Allow Scraper errors to be logged at `DEBUG` level

opentelemetry-collector
opentelemetry-collector copied to clipboard