opentelemetry-lambda icon indicating copy to clipboard operation
opentelemetry-lambda copied to clipboard

Fix XRay Lambda Propagator Implementation Across OTel SDKs

Open garysassano opened this issue 8 months ago • 7 comments

Summary

The current implementation of the AWS X-Ray Lambda propagator (AWSXRayLambdaPropagator) in OpenTelemetry needs to be fixed to properly handle the Sampled=0 flag in the X-Ray trace header set via the _X_AMZN_TRACE_ID environment variable. This affects all OTel SDKs that implement the XRay Lambda propagator.

Background

When AWS Lambda functions run, AWS automatically sets the _X_AMZN_TRACE_ID environment variable. When X-Ray tracing is not enabled for the Lambda, AWS still sets this variable but with Sampled=0, indicating the request should not be sampled.

Source: AWS Lambda docs

The current OpenTelemetry implementation, as described in the specification, incorrectly extracts context from this environment variable even when Sampled=0 is present. This causes any other propagators (like W3C TraceContext) to be skipped, effectively disabling distributed tracing when X-Ray is not enabled.

Source: OTel Lambda instrumentation docs

Issue

The main issue is in the pseudocode implementation in the OpenTelemetry specification:

extract(context, carrier) {
    xrayContext = xrayPropagator.extract(context, carrier)

    // To avoid potential issues when extracting with an active span context (such as with a span link),
    // the `xray-lambda` propagator SHOULD check if the provided context already has an active span context.
    // If found, the propagator SHOULD just return the extract result of the `xray` propagator.
    if (Span.fromContext(context).getSpanContext().isValid())
      return xrayContext

    // If xray-lambda environment variable not set, return the xray extract result.
    traceHeader = getEnvironment("_X_AMZN_TRACE_ID")
    if (isEmptyOrNull(traceHeader))
      return xrayContext

    // Apply the xray propagator using the span context contained in the xray-lambda environment variable.
    return xrayPropagator.extract(xrayContext, ["X-Amzn-Trace-Id": traceHeader])
}

This implementation always extracts from the environment variable if it exists, even when Sampled=0 is present. This forces all spans to be non-sampled even when other propagators like W3C TraceContext would otherwise create a root span.

Proposed Fix

The implementation should be updated to check if the trace header in the environment variable contains Sampled=0 and skip extraction in that case. Here is the proposed updated pseudocode:

extract(context, carrier) {
    // First try to extract from carrier
    xrayContext = xrayPropagator.extract(context, carrier)

    // Check if we got a valid context from the carrier
    if (hasValidSpan(xrayContext))
      return xrayContext

    // Check the environment variable
    traceHeader = getEnvironment("_X_AMZN_TRACE_ID")

    // If no env var or Sampled=0, do not extract further
    if (isEmptyOrNull(traceHeader) || traceHeader.contains("Sampled=0"))
      return xrayContext

    // Fallback: extract from the environment variable
    envCarrier = {"X-Amzn-Trace-Id": traceHeader}
    return xrayPropagator.extract(xrayContext, envCarrier)
}

// Helper function to check if a context has an active span
function hasValidSpan(context) {
    span = Span.fromContext(context)
    spanContext = span.getSpanContext()
    return spanContext.isValid()
}

This fix has been implemented in Node.js, Python, and Rust versions of the library with consistent behavior across all OTel SDKs.

Benefits of This Fix

  1. Properly respects the Sampled=0 flag, allowing other propagators to create root spans when X-Ray is not enabled
  2. Ensures consistent behavior across all language implementations
  3. Maintains backward compatibility for normal X-Ray tracing scenarios
  4. Allows proper integration with W3C TraceContext and other propagation mechanisms

Implementations

Working implementations have been created for:

Action Items

  1. Update the OpenTelemetry specification with the corrected pseudocode
  2. Implement the fix in all OTel SDKs
  3. Release updates with this fix as a priority for AWS Lambda users

Testing

The fix can be verified by:

  1. Creating a Lambda function with X-Ray tracing disabled
  2. Configuring OpenTelemetry with W3C TraceContext and XRay Lambda propagators
  3. Verifying that traces are properly created and sampled by the W3C propagator

garysassano avatar Apr 18 '25 06:04 garysassano

Is this not something that should be fixed in the propagator itself? i.e.: https://github.com/open-telemetry/opentelemetry-js-contrib/tree/main/propagators/propagator-aws-xray-lambda https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/propagator/opentelemetry-propagator-aws-xray ...

wpessers avatar Apr 18 '25 13:04 wpessers

@wpessers Yes, as I mentioned, this needs to be addressed across all OTel SDKs. I opened the issue here as a central reference rather than filing separate issues in each opentelemetry-<language>-contrib repository, since this propagator is specifically designed for Lambda.

garysassano avatar Apr 18 '25 14:04 garysassano

@garysassano first, you should only be using xray-lambda if you're reporting spans to AWS X-Ray, and likely only if you enable Lambda's Active Tracing. Otherwise, you should be using the xray propagator. Second, propagators are applied in order, such that the last propagator overwrites previous propagators. Have you considered changing the order if you want to prioritize w3c propagation?

tylerbenson avatar May 07 '25 15:05 tylerbenson

@garysassano Here's a question.... if you were using the following propagators: traceparent,xray, what behavior would you expect if traceparent returned a valid and sampled span context, but the xray header contained Sampled=0? I don't think xray would defer to traceparent. I don't think any propagators do if there's a valid "header" (however the propagator defines as valid... w3c requires both span and trace id, but xray proceeds with just Sampled=0).

Perhaps this concern is more with how the xray propagator handles things in general? For example, perhaps this should return context here instead of continuing?

tylerbenson avatar May 07 '25 16:05 tylerbenson

@tylerbenson, thank you, and here are my two cents on this topic, and please correct me if I’m wrong:

We have two scenarios: Scenario A (plain xray):

  • composite propagator is traceparent, xray
  • request carrier has an X‑Amzn‑Trace‑Id header with Sampled=0
  • sampler = ParentBased

In this case xray runs last, sees Sampled=0, and marks the context “not‑sampled”. That’s fine, because the upstream X‑Ray client has already decided this trace should be dropped. If the request instead had a traceparent header, the xray propagator would just be ignored and do nothing "bad".

Scenario B (xray‑lambda, the problematic one):

  • composite propagator is traceparent, xray‑lambda
  • X‑Ray is disabled for the function
  • request has a sampled W3C traceparent header but no X‑Amzn‑Trace‑Id
  • Lambda runtime still sets the _X_AMZN_TRACE_ID env‑var with Sampled=0 (at least in all the scenarios i tried)
  • sampler = ParentBased(root=AlwaysOn)

In case, the W3C textmap propagator sees the traceparent header, extracts a sampled context, but then then xray‑lambda one is invoked, it falls back to the env‑var, where it sees that Sampled=0, and overwrites the sampled context.

I don't think that in this case the order matter because both propagators run and once the content is set to non-sampling, I don't think it gets reset.

So the span is silently dropped, which is a bummer, because that was a really important trace that the sender wanted to be tracked. I think this is what @garysassano is referring to.

On another note, I completely agree that one could just avoid the xray-lambda propagator altogether, but using it provides some better insights on a lambda "internals", compared to the plain xray one.

In the environment variable value for the xray traceid, the parent segment points to an internal span created by the runtime, which is now surfaced with Application Signals in an otel-ish way, which in turns contains other segments representing the Init and Overhead phased. And also, even though is not super clear, it may be useful for other "active tracing" enabled services, like ApiGateway or SQS/SNS (or at least one would hope so).

See this trace for example, where:

  • invoke python-stdout is the client (sending an xray header)
  • python-stdout/LambdaService is the lambda runtime
  • python-stdout/LambdaExecutionEnvironment is the execution environment for this request
  • Init/LambdaExecutionEnvironment is the init phase
  • Overhead/LambdaExecutionEnvironment is the overhead
Image

Conversely, for some reason, the xray segment that is generated by APIGateway (or Lambda URL), the one that you can find in the regular HTTP header, is invalid (for instance, Root=1‑5759e988‑bd862e3fe1be46a994272793 (often only Root, sometimes ;Sampled=0) and doesn't really provide any valuable insight.

So, in the end, I think there may be more digging here to do because it's not at all a simple topic, and I am well aware that I may be missing other crucial details, but in a hybrid W3C/AWS X-ray scenario, I think that the idea of letting the xray-lambda propagator skip processing traces when Sampled=0 is not a terrible one. Or at least I think so.

alessandrobologna avatar May 07 '25 22:05 alessandrobologna

I wish there would be a hint for us to understand whether Sampled=0 because

  • AWS X-Ray is not activated
  • or parent trace context is not sampled

Since we don't know the actual reason of why Sampled=0, as far as I see, even though it is not a perfect solution, using xray propagator (instead of xray-lambda propagator) is the best way for now if the traces are not exported to XRay, but other OTEL compatible backend.

However I wonder what are cases where AWS X-Ray is enabled and trace context is only propagated through _X_AMZN_TRACE_ID env var to the Lambda function, not by X‑Amzn‑Trace‑Id header in the request carrier?

serkan-ozal avatar May 08 '25 18:05 serkan-ozal

One option to consider is to configure the xray propagator and adding a span link to the XRay ActiveTracing span if it's available. Would require instrumentation changes though.

tylerbenson avatar Jun 17 '25 18:06 tylerbenson