NServiceBus icon indicating copy to clipboard operation
NServiceBus copied to clipboard

OpenTelemetry: In tail-sampling scenarios, failures that are fixed in subsequent retries may be harmful

Open lailabougria opened this issue 2 years ago • 2 comments

Describe the feature.

Is your feature related to a problem? Please describe.

Every failure that occurs causes the span to be tagged as failed. Suppose the customer is using a tail-sampling strategy, keeping all failed traces. In that case, they may want to filter out traces that include failures that were solved by subsequent retries and rather only sample those who consistently failed, to the point that they were moved to the error queue.

Describe the requested feature

Make the behavior on tracking failures configurable, eg. only mark messages that are moved to the error queue as failed, and rather add specific tags to failures (eg. exception message, nr of retries, etc) when there are retries left based on the recoverability policy. This way, users may still identify traces that included retries, but only failed messages (that went to the error queue) will show up as failed traces.

Describe alternatives you've considered

Additional Context

No response

lailabougria avatar May 02 '23 13:05 lailabougria

Would using span links rather than child spans solve this?

andreasohlund avatar Apr 22 '24 10:04 andreasohlund

It would, for any delayed retries we should use span links instead of child spans. Users can control the time they wait for a single trace to complete (for tail-sampling), and that value usually fluctuates around 5 seconds.

lailabougria avatar Apr 22 '24 12:04 lailabougria

Discussed with OpenTelemetry community and validated that this is a false assumption. Not needed for now.

SzymonPobiega avatar Jun 17 '24 08:06 SzymonPobiega