Default behavior of consumer errors in the pipeline
As of today in the collector main receiver OTLP receiver, errors are by default treated as retryable unless otherwise specified by the source by using the consumererror.NewPermanent. This requires components to accordingly mark every error permanent or not.
Based on the OTLP protocol though, only a very small subset of cases must be retryable, see https://github.com/open-telemetry/opentelemetry-proto/blob/main/docs/specification.md#failures-1.
In contrib right now, looking at some of the most used processor attributes, resource, resourcedetector, and transform:
- They are all returning errors in some scenarios like failing some filtering conditions, etc.
- None of these components are returning permanent errors, even though I am highly confident that these errors are deterministic and a retry will not change the behavior.
Proposal: Since, I've seen this problem in practice, I would like to propose that we change the behavior of the consumer pipeline errors and treat all errors as permanent by default unless specified otherwise.
I agree. I remember we discussed this and decided to make the retriable errors opt-in as part of the effort to revisit error propagation that needs to be revamped: https://github.com/open-telemetry/opentelemetry-collector/issues/7047, https://github.com/open-telemetry/opentelemetry-collector/pull/11085
Slightly different but related to this issue: Currently, if two exporters in the same pipeline return an error for the same call, the fanoutconsumer will return a multierr.multiError, which will be considered ~retryable no matter the type of the component errors~ permanent if any component error is permanent. Is this the expected behavior?
Considering this, the new Error wrapper (current and hopefully final PR: #13042), and potentially a new "downstream" wrapper to implement the pipeline instrumentation RFC (this part specifically), I think we may want to lay down precisely how all these error wrappers are supposed to interact.
Current uses of consumererror.NewPermanent: https://github.com/search?q=repo%3Aopen-telemetry%2Fopentelemetry-collector+OR+repo%3Aopen-telemetry%2Fopentelemetry-collector-contrib+consumererror.NewPermanent+language%3AGo&type=code&p=1
Proposed approach for doing this:
- Make it so that
IsPermanentlooks something like this, withdefaultIsPermanentbeingfalse:
func IsPermanent(err error) bool {
if err == nil {
return false
}
if errors.As(err, &permanent{}) {
return true
}
var consumerErr Error
if errors.As(err, &consumerErr) {
return consumerErr.IsRetryable()
}
return defaultIsPermanent
}
- Make sure all components/codepaths that use
NewPermanentalso useNewRetryableErrorfor errors that are not permanent. - Introduce a
consumererror.DefaultToPermanentErrorfeature gate that handles the value ofdefaultIsPermanent. - Once
consumererror.DefaultToPermanentErroris in beta stage, deprecateNewPermanentandIsPermanent. - Once
consumererror.DefaultToPermanentErroris in stable stage, removeNewPermanentand replace usages ofIsPermanentwith checking theIsRetryablemethod.
Thoughts about this approach?
It looks like there is already a NewRetryableError function which uses the new consumererror.Error struct. Could we use that instead? The main task would be to update IsPermanent to take it into account.
Ah! Yes, I hadn't realized that. Okay, updated my comment to use NewRetryableError and IsRetryable instead
@mx-psi the approach above sounds good https://github.com/open-telemetry/opentelemetry-collector/issues/12393#issuecomment-3318609673