opentelemetry-collector-contrib
opentelemetry-collector-contrib copied to clipboard
[exporter/loki] Consider 400 response to be a permanent err
Is your feature request related to a problem? Please describe. Currently, if loki returns a 400 error code the request will be retried even though its never going to succeed. In extreme cases all the sending queue workers can be kept busy doing retries (which block the worker) preventing valid log messages from being processed.
In my case, this is being triggered by a client sending log messages that are too far in the past. This could be because it was a VM that was resumed from suspended state or because it started reading a log file from the beginning.
Describe the solution you'd like If loki returns a 400 error code, it should be considered a permanent error and not retried.
It might be appropriate to do this for all 4xx responses, with the exception of 429 (which I think should be retried).
Describe alternatives you've considered It might be possible to filter out old log messages earlier in the pipeline.
Its worth noting that Loki will send a 400 response if only one log message in a large batch is invalid, causing the collector to think all the messages in the batch were rejected when in fact it was only one of them. This causes us to report the number of dropped log messages as higher than it really is.
I'm happy to submit a PR along the following lines
if resp.StatusCode >= http.StatusBadRequest && resp.StatusCode < http.StatusTooManyRequests {
return consumererror.NewPermanent(err)
}
return consumererror.NewLogs(err, ld)
Pinging code owners: @gramidt @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.
@jpkrohling - What are your thoughts?
Sounds reasonable to me!