Missing "x-amzn-trace-id" (and other trace ids) in response headers
Title: Missing "x-amzn-trace-id" (and other trace ids) in response headers
Description:
This was originally reported against AWS AppMesh / Envoy for X-Ray integration (https://github.com/aws/aws-app-mesh-roadmap/issues/394), but further evaluation of the issue by @suniltheta showed that it's an issue with any tracing extension, and results from how Envoy support for tracing extensions is built internally. Long story short: Envoy doesn't add tracing headers to response if they're missing. This way, proper serving on trace headers fully relies on whether the application container uses tracing SDK (e.g. XRAY SDK) or is otherwise configured to propagate tracing headers from Request to Response.
Repro steps:
- Deploy an ECS Fargate Service with Envoy, X-Ray Sidecar and raw, unconfigured
nginxcontainer acting as the "application container".- Configure Envoy to send Traces to X-Ray
- Send a bunch of requests to this service using
curl -v ...- Observe the following: 4.1. Traces of the requests are visible in X-Ray 4.2. Nginx sees the
x-amzn-trace-idheader, but is not configured to send it back (also, see "Why i think...." below) 4.3. The response - as seen bycurl- doesn't show thex-amzn-trace-idheader.In https://github.com/aws/aws-app-mesh-roadmap/issues/394 I present in much more details the configurations that were tested using ECS and AppMesh. @suniltheta also describes how he tested this with X-Ray and Jaeger
Why I think this is a bug?
In ideal scenario, where our "application container" is properly configured with XRAY SDK, or performs otherwise configured request-to-response header rewrite, all is fine. But in that ideal scenario, we rarely ever need the trace id. Things get ugly, when the requests start failing. Consider following scenarios:
- The "application service" times out and envoy responds with 504.
- The "application service" fails with some internal error and responds with a
5xxresponse code, but does not provide trace-id in the response.- The "application service" contains a misconfigured XRAY SDK, and the tracing headers are not copied to response
- The "application service" is a 3rd party, closed-source app, which has no way of being configured to copy these headers.
In any of these cases the client is unable to provide us with any meaning trace-id to debug issues with the request. I believe it should be Envoy's responsibility to - once it initiated new trace on the incoming request - to ensure that the ID of that trace is returned to the client. I believe Envoy should delegate the check for existence of the Trace ID to configured Tracing Extensions not only for request processing, but also for response processing.
Few more things to consider:
- In a typical scenario we host our ECS/AppMesh Services behind an AWS API Gateway. In this case, it's the API Gateway that initiates the trace (creates and populates the
x-amzn-trace-idheader on therequest) and pushes it up the stream. In this case, handling trace headers in Envoy is somewhat less important, because API Gateway will check the presence of thex-amzn-trace-idin the response, and will add it if it's missing - so, AWS API Gateway already works the way I'd like Envoy to work. - A less typical, but still valid scenario is this: Anything within the network can fire requests towards ECS Services by itself. The "initiating" party may be: 2.1. An ECS Service listening on SQS Queue / Kafka Topic 2.2. An ECS Task spawned as a result of Clouwatch Scheduled Event / EventBus Event 2.3. A script running on an EC2, reacting on some activity happening on EC2 2.4. etc.
In any of the cases in (2), there's no "external client" that would send its requests via AWS API Gateway and the response-level tracing headers are not enforced, unless it's the called application itself that decides to send them back.
cc: @suniltheta
So here are my thoughts:
Currently the request headers are injected with the trace context only in the request path by calling func injectContext on the span object ref: code. In the response path we don't have to inject a new context again because we don't want to re-start the lifecycle of the span for a particular request.
But what we can instead make is while we call the finishSpan inside finalizeDownstreamSpan or finalizeUpstreamSpan, can we also add a new functionality called copyContextIfMissing for span class?
2 things can happen here:
- Response already contains the trace header context (i.e., Backend is instrumented). We don't replace/update it since the upstream might have updated the trace headers.
- Response doesn't contain the trace header context (i.e., Backend is not instrumented). We will just copy the trace related header fields from request header to response header. This can be similar to
always_set_request_id_in_responsebool (ref: code) config flag that we have in HttpConnectionManager. Where we include if the headers are missing. Also need to decide where we want to give the config option (sayalways_set_trace_context_in_response), whether in the HttpConnectionManager (preferably) or the Tracer Config itself.
In saying that I am not sure if this was a conscious design decision that was made during the initial days of implementation of tracing support in envoy to not explicitly add the trace context again in the response path. I have this question for the community to seek their opinion. If this is not an issue then I can work on implementation as well.
I tried to look through the earlier issues on this github, I am not sure if this question was brought up earlier and a decision was made there.
Tagging for attention: @dio @lizan @mattklein123
In saying that I am not sure if this was a conscious design decision that was made during the initial days of implementation of tracing support in envoy to not explicitly add the trace context again in the response path. I have this question for the community to seek their opinion. If this is not an issue then I can work on implementation as well.
In general, it not a requirement of some (most?) tracing systems for responses to be annotated with a header. Spans are opened on the request path and then finalized at some later time. Each hop owns its owned span start/stop logic, so nothing has to be returned to the previous hop to make this happen.
I'm not sure if this can be done generically for all tracers, but it seems reasonable to me to allow a tracer to be configured to send back headers are finalizing time if it chooses. For example, if there is no response with sufficient context, local context can be added.
/assign suniltheta
suniltheta is not allowed to assign users.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
/no stalebot
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
no stale
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
no stalebot
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
/assign
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
no stale bot
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
no stale
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.