statsd_exporter icon indicating copy to clipboard operation
statsd_exporter copied to clipboard

Safe logging of problems in incoming events [was: Malformed name tag - Log level should be higher]

Open Cruuncher opened this issue 1 year ago • 2 comments

The following log is currently set at debug level: https://github.com/prometheus/statsd_exporter/blob/58769c7b4d128e7ceae1cf8d893260ed5b5afa4d/pkg/line/line.go#L124

This log is very useful for debugging missing tags in metrics. However to get it we have to set statsd-exporter at debug level logging, which also brings in the "incoming line" log which logs way too much causing issues for us.

Can we increase this log to at least info level? Debug doesn't make sense to me for such important info

Cruuncher avatar Apr 01 '24 21:04 Cruuncher

I can open the PR if the change is approved

Cruuncher avatar Apr 01 '24 21:04 Cruuncher

The reason we set it at debug level is because it might emit a log line for more or less every incoming event. As a general rule, we've kept everything O(incoming) out of info-level logs, to avoid logging volume (= cost) surprises in normal operation.

Some people run this exporter with a lot of events coming in, logging a significant fraction of these for even a short period across the whole fleet may be wildly expensive. On the other hand, you're right that we are not giving them a lot of options to find out what's wrong without turning on the firehose on at least a subset.

If you're up for it, I would love to have rate limited logging of problematic event lines. What I roughly imagine is that we identify the non-varying parts of the log line, and only emit one line per some time period for each of these. We'll probably want some settings to tune that, since what's acceptable may vary a lot, but as a default I would say one log line per metric cardinality per minute would be a good starting point. Ideally we'd have some indication how many lines we've rate limited away when we do finally flush them.

In general, this would allow us to elevate many of the "something is wrong with this event" logs to Warning, so they're shown and stand out by default.

On this log line in particular:

  • how useful is it in isolation? How much do you need to know about the full incoming event to make sense of it?
  • In this function, there's two uses of the "malformed name tag" message, making it indistinguishable which error path you hit. When touching this, please give them distinct messages.

matthiasr avatar Aug 18 '24 13:08 matthiasr