"context canceled" is Added as a Span Event on `cortex.ingester/QueryStream` Trace
Describe the bug When a QueryStream operation ends due to the context being canceled, the error is added to the trace's span event.
To Reproduce Steps to reproduce the behavior:
- Start Cortex with tracing enabled.
- Run a query that the distributor sends to more than one ingester.
Expected behavior If an ingester context is canceled during the query (which I understand is normal operation of cortex?), then the operation results in an OK span status with no attached span event.
Environment:
- Infrastructure: Kubernetes
- Deployment tool: Helm
Additional Context
The span in question: https://github.com/cortexproject/cortex/blob/ab3ca0a967a60ebc46b5e9b3141b9bd26c893e00/pkg/ingester/ingester.go#L1744-L1745
Similar issue from the past: https://github.com/cortexproject/cortex/issues/1279
How it was fixed in WeaveWorks: https://github.com/weaveworks/common/pull/148/files
An example of a failing trace. In this example, there were five parallel query streams, and the one that was canceled was the slowest.
And its span event:
The span /cortex.Ingester/QueryStream was instrumented automatically by gRPC tracing middleware I think. It is not https://github.com/cortexproject/cortex/blob/ab3ca0a967a60ebc46b5e9b3141b9bd26c893e00/pkg/ingester/ingester.go#L1744-L1745 codepath.
We need to change gRPC middleware library behavior to ignore context canceled error. I am not sure if it is something we can do easily.
As a workaround, I added a transform processor to our Mimir OpenTelemetry collector that watches for this case and sets the span status to OK.
processors:
transform/cortexquerycontextcanceledspanevent:
error_mode: ignore
trace_statements:
- context: spanevent
statements:
- set(span.status.code, 1) where (span.name == "/cortex.Ingester/QueryStream" and attributes["message"] == "context canceled")