envoy icon indicating copy to clipboard operation
envoy copied to clipboard

Missing "x-amzn-trace-id" (and other trace ids) in response headers

Open mkielar opened this issue 3 years ago • 18 comments

Title: Missing "x-amzn-trace-id" (and other trace ids) in response headers

Description:

This was originally reported against AWS AppMesh / Envoy for X-Ray integration (https://github.com/aws/aws-app-mesh-roadmap/issues/394), but further evaluation of the issue by @suniltheta showed that it's an issue with any tracing extension, and results from how Envoy support for tracing extensions is built internally. Long story short: Envoy doesn't add tracing headers to response if they're missing. This way, proper serving on trace headers fully relies on whether the application container uses tracing SDK (e.g. XRAY SDK) or is otherwise configured to propagate tracing headers from Request to Response.

Repro steps:

  1. Deploy an ECS Fargate Service with Envoy, X-Ray Sidecar and raw, unconfigured nginx container acting as the "application container".
  2. Configure Envoy to send Traces to X-Ray
  3. Send a bunch of requests to this service using curl -v ...
  4. Observe the following: 4.1. Traces of the requests are visible in X-Ray 4.2. Nginx sees the x-amzn-trace-id header, but is not configured to send it back (also, see "Why i think...." below) 4.3. The response - as seen by curl - doesn't show the x-amzn-trace-id header.

In https://github.com/aws/aws-app-mesh-roadmap/issues/394 I present in much more details the configurations that were tested using ECS and AppMesh. @suniltheta also describes how he tested this with X-Ray and Jaeger

Why I think this is a bug?

In ideal scenario, where our "application container" is properly configured with XRAY SDK, or performs otherwise configured request-to-response header rewrite, all is fine. But in that ideal scenario, we rarely ever need the trace id. Things get ugly, when the requests start failing. Consider following scenarios:

  1. The "application service" times out and envoy responds with 504.
  2. The "application service" fails with some internal error and responds with a 5xx response code, but does not provide trace-id in the response.
  3. The "application service" contains a misconfigured XRAY SDK, and the tracing headers are not copied to response
  4. The "application service" is a 3rd party, closed-source app, which has no way of being configured to copy these headers.

In any of these cases the client is unable to provide us with any meaning trace-id to debug issues with the request. I believe it should be Envoy's responsibility to - once it initiated new trace on the incoming request - to ensure that the ID of that trace is returned to the client. I believe Envoy should delegate the check for existence of the Trace ID to configured Tracing Extensions not only for request processing, but also for response processing.

mkielar avatar Mar 09 '22 12:03 mkielar

Few more things to consider:

  1. In a typical scenario we host our ECS/AppMesh Services behind an AWS API Gateway. In this case, it's the API Gateway that initiates the trace (creates and populates the x-amzn-trace-id header on the request) and pushes it up the stream. In this case, handling trace headers in Envoy is somewhat less important, because API Gateway will check the presence of the x-amzn-trace-id in the response, and will add it if it's missing - so, AWS API Gateway already works the way I'd like Envoy to work.
  2. A less typical, but still valid scenario is this: Anything within the network can fire requests towards ECS Services by itself. The "initiating" party may be: 2.1. An ECS Service listening on SQS Queue / Kafka Topic 2.2. An ECS Task spawned as a result of Clouwatch Scheduled Event / EventBus Event 2.3. A script running on an EC2, reacting on some activity happening on EC2 2.4. etc.

In any of the cases in (2), there's no "external client" that would send its requests via AWS API Gateway and the response-level tracing headers are not enforced, unless it's the called application itself that decides to send them back.

mkielar avatar Mar 09 '22 12:03 mkielar

cc: @suniltheta

RyanTheOptimist avatar Mar 10 '22 00:03 RyanTheOptimist

So here are my thoughts:

Currently the request headers are injected with the trace context only in the request path by calling func injectContext on the span object ref: code. In the response path we don't have to inject a new context again because we don't want to re-start the lifecycle of the span for a particular request.

But what we can instead make is while we call the finishSpan inside finalizeDownstreamSpan or finalizeUpstreamSpan, can we also add a new functionality called copyContextIfMissing for span class?

2 things can happen here:

  1. Response already contains the trace header context (i.e., Backend is instrumented). We don't replace/update it since the upstream might have updated the trace headers.
  2. Response doesn't contain the trace header context (i.e., Backend is not instrumented). We will just copy the trace related header fields from request header to response header. This can be similar to always_set_request_id_in_response bool (ref: code) config flag that we have in HttpConnectionManager. Where we include if the headers are missing. Also need to decide where we want to give the config option (say always_set_trace_context_in_response), whether in the HttpConnectionManager (preferably) or the Tracer Config itself.

In saying that I am not sure if this was a conscious design decision that was made during the initial days of implementation of tracing support in envoy to not explicitly add the trace context again in the response path. I have this question for the community to seek their opinion. If this is not an issue then I can work on implementation as well.

I tried to look through the earlier issues on this github, I am not sure if this question was brought up earlier and a decision was made there.

Tagging for attention: @dio @lizan @mattklein123

suniltheta avatar Mar 11 '22 20:03 suniltheta

In saying that I am not sure if this was a conscious design decision that was made during the initial days of implementation of tracing support in envoy to not explicitly add the trace context again in the response path. I have this question for the community to seek their opinion. If this is not an issue then I can work on implementation as well.

In general, it not a requirement of some (most?) tracing systems for responses to be annotated with a header. Spans are opened on the request path and then finalized at some later time. Each hop owns its owned span start/stop logic, so nothing has to be returned to the previous hop to make this happen.

I'm not sure if this can be done generically for all tracers, but it seems reasonable to me to allow a tracer to be configured to send back headers are finalizing time if it chooses. For example, if there is no response with sufficient context, local context can be added.

mattklein123 avatar Mar 14 '22 20:03 mattklein123

/assign suniltheta

suniltheta avatar Mar 16 '22 21:03 suniltheta

suniltheta is not allowed to assign users.

:cat:

Caused by: a https://github.com/envoyproxy/envoy/issues/20272#issuecomment-1069658722 was created by @suniltheta.

see: more, trace.

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Apr 21 '22 16:04 github-actions[bot]

/no stalebot

suniltheta avatar Apr 21 '22 16:04 suniltheta

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar May 22 '22 00:05 github-actions[bot]

no stale

suniltheta avatar May 22 '22 00:05 suniltheta

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jun 21 '22 04:06 github-actions[bot]

no stalebot

suniltheta avatar Jun 21 '22 23:06 suniltheta

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jul 22 '22 04:07 github-actions[bot]

/assign

suniltheta avatar Jul 22 '22 05:07 suniltheta

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 21 '22 08:08 github-actions[bot]

no stale bot

suniltheta avatar Aug 22 '22 18:08 suniltheta

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Sep 22 '22 00:09 github-actions[bot]

no stale

suniltheta avatar Sep 22 '22 00:09 suniltheta

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Oct 22 '22 08:10 github-actions[bot]

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

github-actions[bot] avatar Oct 29 '22 08:10 github-actions[bot]