opentelemetry-python
opentelemetry-python copied to clipboard
Incorrect validation of baggage values
Describe your environment python: 3.10.6 opentelemetry-api: 1.12.0 opentelemetry-sdk: 1.12.0
Steps to reproduce
from opentelemetry.baggage import _is_valid_pair
_is_valid_pair("sentry-transaction", "GET%20%2Fapi%2F%2Freport") # OK
from opentelemetry.baggage.propagation import W3CBaggagePropagator
W3CBaggagePropagator().extract({"baggage": "sentry-transaction=GET%20%2Fapi%2Freport"})
# Returns empty and raises warning Invalid baggage entry: `sentry-transaction=GET%20%2Fapi%2Freport`
What is the expected behavior? Opentelemetry should be able to parse baggage values which are percent encoded, as that's indicated by the W3C spec
A value contains a string whose character encoding MUST be UTF-8 [Encoding]. Any characters outside of the baggage-octet range of characters MUST be percent-encoded. Characters which are not required to be percent-encoded MAY be percent-encoded.
What is the actual behavior? No baggage values were extracted and the following warning was raised:
Invalid baggage entry: `sentry-transaction=GET%20%2Fapi%2Freport`
Additional context The problem is happening here: https://github.com/open-telemetry/opentelemetry-python/blob/75313b6c6d58945c1401622b6683ccdd28657984/opentelemetry-api/src/opentelemetry/baggage/propagation/init.py#L92
Note also that the indicated code path seems to be not be too optimized, as the call to set_baggage
will validate both the key and values again. And additionally, for each baggage value a new context will be created. Maybe it'd be better to first collect all baggage values and set them only once in the context.
@srikanthccv @ocelotl what do you think?
This looks like a legitimate bug. I haven't gotten time to triage this. I will check.
Checking...
Ok, I think this is what's happening here:
The definition for the baggage value
says:
value = *baggage-octet
baggage-octet = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
; US-ASCII characters excluding CTLs,
; whitespace, DQUOTE, comma, semicolon,
; and backslash
That is why our regex value
is this:
_VALUE_FORMAT = r"[\x21\x23-\x2b\x2d-\x3a\x3c-\x5b\x5d-\x7e]*"
That regex does not allow space characters, the reported string has one in %20
. The spec for value
says:
Leading and trailing whitespaces (OWS) are allowed and are not considered to be a part of the value.
I think that means we should first strip any leading and trailing whitespace characters from the value
string, then apply the regex (we don't do that so I think we do have a bug).
The space character in "GET%20%2Fapi%2F%2Freport"
is not leading nor trailing so I think we are doing the right thing by not accepting this string.
What do you think, @srikanthccv?
@ocelotl in the same spec you are citing you are missing the important part
Any characters outside of the baggage-octet range of characters MUST be percent-encoded.
We are running into a similar issue. FWIW, The OTEL client for NodeJS handles values with spaces so long as they are percent-encoded.
I'm not sure extract
should accept percent-encoded strings that have not been percent-encoded themselves, from here:
Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI.
I think extract
is doing the right thing by rejecting a string that comes with a %20
character because it expects that string to come from the result of an inject
method who should have percent-encoded that %20
(leading to %2520
). What is happening here is that a percent-encoded string is being passed to extract
without first passing it to inject
, which leads to an invalid string which does not have its percent-encoded characters percent-encoded themselves. When the same string is passed first to inject
the string is handled right:
from opentelemetry.baggage.propagation import W3CBaggagePropagator
from opentelemetry.context import get_current
from opentelemetry.baggage import set_baggage
carrier = {}
w3c_baggage_propagator = W3CBaggagePropagator()
context = set_baggage(
"baggage",
"sentry-transaction=GET%20%2Fapi%2Freport",
context=get_current()
)
w3c_baggage_propagator.inject(carrier, context)
print(carrier)
context = w3c_baggage_propagator.extract(carrier)
print(context)
{'baggage': 'baggage=sentry-transaction%3DGET%2520%252Fapi%252Freport'}
{'baggage-b037abe3-81b2-4195-86e6-929af2d86e99': {'baggage': 'sentry-transaction=GET%20%2Fapi%2Freport'}}
It is impossible for extract
to handle percent-encoded strings that have not been percent-encoded themselves before, because how can extract
tell if a percent symbol comes from the original string (then expect
should leave it as it is) or from the encoding done by inject
(then expect
should decode it)?
I don't think this is a bug, WDYT @srikanthccv?
@ocelotl I see your point, you are saying that the spec is ambiguos because given that the baggage-octet range includes the '%' symbol and the language around when to encode string, then there's no way to distinguish whether a%20b
means a%20b
, which would be fine because all characters are within the range, or whether it means a b
, which would also be fine as the space is outside the range, and hence must be percent encoded.
Note however, that in Java, the value is always percent encoded, so the case above would not be ambiguous, as in the first case it would be encoded as a%2520b
and in the second as a%20b
. See https://cs.github.com/open-telemetry/opentelemetry-java/blob/0e41b1469d1fbc20c4fc5e7b59df0f4ac54330b6/api/all/src/main/java/io/opentelemetry/api/baggage/propagation/W3CBaggagePropagator.java#L68
In Java, it also seems there's no validation happening for the value, and the value is always being decoded. See https://cs.github.com/open-telemetry/opentelemetry-java/blob/0e41b1469d1fbc20c4fc5e7b59df0f4ac54330b6/api/all/src/main/java/io/opentelemetry/api/baggage/propagation/Parser.java#L128 and the function that calls it.
Note though, that your example should be:
from opentelemetry.baggage.propagation import W3CBaggagePropagator
from opentelemetry.context import get_current
from opentelemetry.baggage import set_baggage
carrier = {}
w3c_baggage_propagator = W3CBaggagePropagator()
context = set_baggage(
"sentry-transaction",
"GET /api/report",
context=get_current()
)
print(context)
w3c_baggage_propagator.inject(carrier, context)
print(carrier)
context = w3c_baggage_propagator.extract(carrier)
print(context)
which unfortunately does not work, because you are restricting the values that can be used in the baggage, which is no correct.
@ocelotl I see your point, you are saying that the spec is ambiguos because given that the baggage-octet range includes the '%' symbol and the language around when to encode string, then there's no way to distinguish whether
a%20b
meansa%20b
, which would be fine because all characters are within the range, or whether it meansa b
, which would also be fine as the space is outside the range, and hence must be percent encoded.
What I am saying is that for a baggage value to be parseable by extract
it has to be percent-encoded first. In your example above you mentioned this:
from opentelemetry.baggage.propagation import W3CBaggagePropagator
W3CBaggagePropagator().extract({"baggage": "sentry-transaction=GET%20%2Fapi%2Freport"})
# Returns empty and raises warning Invalid baggage entry: `sentry-transaction=GET%20%2Fapi%2Freport`
That string "sentry-transaction=GET%20%2Fapi%2Freport"
is not percent-encoded and that is why you are getting that error. Out of curiosity, why would you pass a non-percent encoded string to extract
?
which unfortunately does not work, because you are restricting the values that can be used in the baggage, which is no correct.
Sorry, how are we restricting the values? :thinking:
@ocelotl that value is percent encoded correctly, and corresponds to the value GET /api/report
. That is a baggage value generated by Sentry, and it is encoded correctly according to the spec.
The problem with your example is that you are misunderstanding how percent encoding works. Please refer to the code used in the java package.
As I mentioned originally, the problem is that the validation of the value is happening after decoding it. But, you should first validate the encoded value conforms to the character range and then decode its value.
You are restricting the values, because the spec, which I have cited many times already, says
A value contains a string whose character encoding MUST be UTF-8 [Encoding]. Any characters outside of the baggage-octet range of characters MUST be percent-encoded. Characters which are not required to be percent-encoded MAY be percent-encoded.
So you are allowed to have any string as a value, but the function set_baggage
rejects any string with characters outside the baggage-octet
range.
Pleas also note that there are at least 2 other open-telemetry implementations that handle this.
@srikanthccv could you please take a look
Ah, ok, now I can understand what you mean, fixing...
@ocelotl Would you like to assign this yourself?
Related, decoding resource attributes from otlpresourcedetector: https://github.com/open-telemetry/opentelemetry-python/pull/3046
@ocelotl Would you like to assign this yourself?
I did already.
What is the status of this? Major issue for production release if we cant contain spaces
@ghandic
There is a PR open pending reviews for this issue. Once that is merged, it should go into the next release.