aws-otel-java-instrumentation
aws-otel-java-instrumentation copied to clipboard
AwsXrayPropagator can create invalid trace headers during baggage propagation
Describe the bug AwsXrayPropagator attempts to include baggage properties when creating the XRay trace header. It has code that attempts to limit the baggage to 256 characters, "as per the spec", but, that code can produce headers that are longer than 256 characters, triggering invalid requests to AWS.
see https://github.com/open-telemetry/opentelemetry-java-contrib/blob/main/aws-xray-propagator/src/main/java/io/opentelemetry/contrib/awsxray/propagator/AwsXrayPropagator.java#L130
According to https://docs.aws.amazon.com/ja_jp/step-functions/latest/apireference/API_StartExecution.html#API_StartExecution_RequestSyntax the trace header spec appears to be 256 characters overall and not 256 characters for the remainder "baggage" bits of the header.
the AWS X-Ray trace header. The trace header can also be passed in the request payload. Length Constraints: Minimum length of 0. Maximum length of 256.
So, AwsXrayPropagator should take into account the entire length of the trace header (so far) when deciding whether to append (further) baggage ... and not just the length of the baggage part of it. The logic is incorrectly only looking at the additional length from baggage.
Steps to reproduce Send requests to the AWS SDK where there is a large baggage header in the tracing context.
What did you expect to see? The AWS SDK call should work
What did you see instead? The AWS SDK call fails, with errors such as:
Caused by: software.amazon.awssdk.services.sfn.model.ValidationException: 1 validation error detected: Value Root=1-00000000-000000009b10a2dc482684aa;Parent=d70e93f77b4a569c;Sampled=0;<244 characters of baggage contents redacted> failed to satisfy constraint: Member must have length less than or equal to 256 (Service: Sfn, Status Code: 400, Request ID:
)
In this failed request, the baggage contents comprised of multiple properties, and the logic stopped appending when the baggage reached 244 characters. At this point, the overall header (which failed the request) was 320 characters long. So ... the truncation logic is working, but is failing to take into account the characters already in the trace header.
Additional context This was reproduced against the SfnAsyncClient startExecutionRequest. I assume it is also reproducible against other AWS services - however, it may be that the Sfn service may be more aggressive than other services with the 256 character validation?
This issue is stale because it has been open 90 days with no activity. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled
This is still an issue, please correct, thanks!
@wangzlei @srprash As the owner of this component, do you mind providing some input here?
May I know where is the baggage from? When this issue happens, what the content in badge, is the StepFunction state machine triggered by an AWS Lambda function?
Hey @wangzlei
The content in the baggage is not important. In the case that triggered this for us:
- it is simply multiple key-value pairs, as is allowed in the W3C baggage spec
- it was a SfN call from application code running on EC2, not a lambda
Note that - my description, above, seems pretty clear about what the problem is. To re-iterate:
- at lines https://github.com/open-telemetry/opentelemetry-java-contrib/blob/main/aws-xray-propagator/src/main/java/io/opentelemetry/contrib/awsxray/propagator/AwsXrayPropagator.java#L116-L127 the trace header is initialised with a bunch of bytes for the current trace context
- at line https://github.com/open-telemetry/opentelemetry-java-contrib/blob/main/aws-xray-propagator/src/main/java/io/opentelemetry/contrib/awsxray/propagator/AwsXrayPropagator.java#L145 the trace header has baggage appended, up to a maximum of 256 additional chars
- the length check in step 2 does not take into account the characters added in step 1, leading to an overall header that is well over 256 characters and which fails validation by (at least) the SfN API
For us, this makes AWS XRay entirely unusable and we've had to drop it. Also, (separate from this bug), the truncation algorithm here is a little simplistic for our needs - we might want to prioritise retaining certain key-value pairs in the baggage (eg. some with specific prefixes or key values), rather than just arbitrarily dropping all baggage after a certain point.
The X-Ray propagator seems to essentially corrupt baggage due to it's incompatibility with the W3C baggage spec. It's also quite unpredictable behaviour due to seemingly requiring to rely on the existing header length to decide where to truncate.
For this reason I suggest that baggage should be ignored by the X-Ray propagator to prevent further issues. This should at least be the default behaviour. Unfortunately spec does not specify this, or anything regarding X-Ray propagator.
Other implementations already do this, for example opentelemetry-js xray propagator: https://github.com/open-telemetry/opentelemetry-js-contrib/blob/main/propagators/opentelemetry-propagator-aws-xray/src/AWSXRayPropagator.ts
Thanks for providing comments.
@jamesmoessis The X-Ray trace header does not differentiate between trace context and baggage like W3C does. In the X-Ray header, everything except for the trace ID, parent ID, and sample flag is treated as baggage. Through the AWS X-Ray propagator, we can convert between the X-Ray header and W3C trace context + W3C baggage. This is proposed as a short-term solution. In the long-term future, X-Ray aims to move forward to W3C and fully adopt W3C trace context and baggage.
@davidconnard Regarding the length of the X-Ray header, I did not find a definition in X-Ray spec. However, based on some clues, it can be inferred that StepFunction likely has a contract with X-Ray, hence there is length checking in the StepFunction client API. So it is reasonable to truncate the baggage length to (256 - the length of the existing trace context).
The reason for querying the baggage content is this issue is likely related to a feature released by Lambda in the latter half of 2023. I need to confirm once again the length of the content Lambda injects into the X-Ray trace header and whether truncating it will affect Lambda's functionality. If you can help print the the baggage introduces the issue it is helpful.
Clarify that this issue is different than https://github.com/open-telemetry/opentelemetry-java-contrib/pull/1178 This issue is awsxraypropagator tries to inject a trace header over 256 to StepFunction. The issue #1178 is awsxraypropagator conflicts with baggage propagator since it truncates the baggage value.
This issue is stale because it has been open 90 days with no activity. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled
This issue is not stale, a correction should be applied to the truncation algorithm.
This issue is stale because it has been open 90 days with no activity. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled