opentelemetry-proto
opentelemetry-proto copied to clipboard
Are API users required to use valid UTF-8?
In https://github.com/open-telemetry/opentelemetry-go/issues/3021, a user reports having spans fail to export because of invalid UTF-8.
I believe the SDK should pass this data through to the receiver, regardless of UTF-8 correctness. I am not sure whether gRPC and the Google protobuf library used by each SDK offers an option to bypass UTF-8 checking, but if so that would be one path forward if we agree.
Another option is to maintain (via sed, e.g.,) copies of the protocol that replace string with bytes; SDKs could use these definitions instead to bypass UTF-8 checking as well.
Strings are UTF8 in Protobuf:
string := valid UTF-8 string, or sequence of 7-bit ASCII bytes; max 2GB
As described, a string must use UTF-8 character encoding. A string cannot exceed 2GB.
So, bypassing SDK checks is not enough. We will be non-compliant with Protobuf spec and it may result in recipients failing to decode the data. Given that this is part of Stable spec I don't think we can change string to bytes anymore.
Could we agree on a transformation to turn invalid UTF-8 into valid UTF-8? Should we reject those strings?
Could we agree on a transformation to turn invalid UTF-8 into valid UTF-8? Should we reject those strings?
Yes, I think these are valid options to consider for the SDK functionality.
There is an issue for this elsewhere already in the spec I believe.
For this case in the Go issue I argued we should be limiting strings not by size but by grapheme so that they don't get cut off into invalid utf8 (or change the meaning of the utf8 string).
Forgot, actually in the Erlang SDK we don't truncate into invalid UTF-8 because we treat the length limit as the # of graphemes https://github.com/open-telemetry/opentelemetry-erlang/blob/main/apps/opentelemetry/src/otel_attributes.erl#L105
The issue is the spec says "character" and not "graphemes" https://github.com/open-telemetry/opentelemetry-specification/blob/ca593258953811eb33197976b8e565299b874853/specification/common/README.md#attribute-limits
I believe there's a question of what we actually want to have, here.
The term character is reasonably well defined in relation to utf-8, but it's not the same as "grapheme" I think?
Since protobuf (w/ v3 syntax) is very strict about UTF-8 and we intend for OTLP/proto to be the standard, we are quite constrained. The Golang protobuf implementation rejects such data in both Marshal() and Unmarshal(), but some do not. Without correcting this problem at the protocol level, we are constrained to one or the other of:
- Validate every
string-valued field before generating OTLP, force correction for invalid UTF-8 - Expect errors somewhere in the Export path
I am opposed to both of these options. The validate-and-force-correction path is simply expensive, especially when the SDK actually repairs invalid data. The expect-errors path is much worse, because generally the entire batch of data is rejected with little clue as to what the actual error was (note the error message in https://github.com/open-telemetry/opentelemetry-go/issues/3021 does not include detail).
Therefore, what I actually want is for SDKs to allow invalid UTF-8. While users are instructed they SHOULD not use invalid UTF-8, SDKs will be specified such that they SHOULD not themselves consider invalid UTF-8 to be an error. As mentioned in my earlier comment, I believe this can be accomplished by maintaining two variant .proto definitions, one where every string field is a wire-compatible bytes field instead.
How this solves the problem:
- SDKs will be advised to use the bytes-form of OTLP so that their marshal and export routines remain ignorant of any UTF-8 problems.
- Processing pipelines will be advised to use the bytes-form of OTLP so that they can process data with invalid UTF-8
- Consumers will be advised that all the
bytesfields SHOULD be interpreted as strings and that UTF-8 validation and correction is their responsibility.
Consumers then can accept data with invalid UTF-8, which I think is ultimately what we all want. I find myself wanting to extend the library guidelines on performance to address this situation. Where we currently write:
Although there are inevitable overhead to achieve monitoring, API should not degrade the end-user application as possible.
I want to add a statement like so:
Although users SHOULD use valid UTF-8 strings everywhere in the OpenTelemetry API, the use of invalid UTF-8 SHOULD NOT degrade the observability of telemetry data. SDKs SHOULD permit invalid UTF-8 to be exported without validating UTF-8, since that would degrade performance unnecessarily.
In addition, I would then add that SDKs SHOULD treat attribute "character limits" assuming though the character set is US-ASCII so that (as a performance concern) truncation logic need not parse UTF-8 code points or identify grapheme clusters.
Processing pipelines will be advised to use the bytes-form of OTLP so that they can process data with invalid UTF-8
Isn't every recipient a processing pipeline from this perspective? This would mean a breaking change for every backend that supports receiving OTLP today.
I want to suggest an alternate statement:
Users SHOULD use valid UTF-8 strings everywhere in the OpenTelemetry API. The use of invalid UTF-8 results in undefined behavior.
So, dear user, you have been warned. Don't do it, if you do and your backend crashes, that's between you and your vendor.
This statement allows you to make choices in the SDKs. If you want to pay the cost and validate it you can do it. If you don't care and want to pass the problem to the recipients you can also do it.
"SHOULD" is a good compromise (I'd be on the side of requiring valid utf-8) but I'd argue that we should also state for the users that valid utf8 will not be broken.
We already specify that any non-valid UTF-8 string MUST be encoded as a bytes_value.
I'm against the chaos that would ensue with forcing every string to be raw bytes with no notion of encoding. I think the current specification around bytes_value (with no encoding flag) is already a bit rough.
- Should we allow non-UTF8? Yes. But only for attribute values. Names, of all forms must be UTF-8 for our protocol to work thanks to protocol buffers.
- Is it ok for API/SDK to re-encode characters for UTF-8 for names? Sure. This may be where the rub is. We should enforce this conversion happens early, so error messages reach users faster.
- Should we modify OTLP to not use the string type in proto? I feel this is rather extreme for little gain. Values today can be in UTF-8, and our semantic conventions tend to also be in UTF-8. I realize this can be restrictive, but it cause major issues in using the protocol that I think are just not necessary.
- Should we ensure users denote the encoding of non-UTF8 strings? I'd say yes, but the state of the world today is ok.
We now have open issues in the spec to handle this in the API/SDK.
Can we close this issue?
Closing. I assume nothing is needed in this repo.
We now have open issues in the spec to handle this in the API/SDK.
Please link any spec issues, for the readers following this issue?
Re-opening, can't find the issue I was referring to. :-(