NServiceBus
NServiceBus copied to clipboard
OpenTelemetry: In tail-sampling scenarios, failures that are fixed in subsequent retries may be harmful
Describe the feature.
Is your feature related to a problem? Please describe.
Every failure that occurs causes the span to be tagged as failed. Suppose the customer is using a tail-sampling strategy, keeping all failed traces. In that case, they may want to filter out traces that include failures that were solved by subsequent retries and rather only sample those who consistently failed, to the point that they were moved to the error queue.
Describe the requested feature
Make the behavior on tracking failures configurable, eg. only mark messages that are moved to the error queue as failed, and rather add specific tags to failures (eg. exception message, nr of retries, etc) when there are retries left based on the recoverability policy. This way, users may still identify traces that included retries, but only failed messages (that went to the error queue) will show up as failed traces.
Describe alternatives you've considered
Additional Context
No response
Would using span links rather than child spans solve this?
It would, for any delayed retries we should use span links instead of child spans. Users can control the time they wait for a single trace to complete (for tail-sampling), and that value usually fluctuates around 5 seconds.
Discussed with OpenTelemetry community and validated that this is a false assumption. Not needed for now.