Invalid UTF-8 error handling policy
This is a draft following research done into
https://github.com/open-telemetry/opentelemetry-specification/issues/3421
and
https://github.com/open-telemetry/opentelemetry-specification/issues/3950.
@XSAM This was discussed in the Spec SIG today. There appears to be not much support for binary-attribute values. I think it's bad for the users, but it's not so bad if we automatically correct invalid UTF-8. Therefore, I will move forward with only half of this proposal.
@open-telemetry/specs-logs-approvers @open-telemetry/specs-metrics-approvers @open-telemetry/specs-trace-approvers
Please consider this updated OTEP.
The changes I have applied:
- The OTel group has already decided not to support byte-valued attributes: document this. (Tough!)
- Specific wording for SDK requirements: SHOULD be opt-out, SHOULD replace invalid sequences w/ �, etc.
- Specific wording for Collector "behavior": SHOULD be opt-out, SHOULD follow each receiver for validation of external data, not recommended for processor manipulations.
@jsuereth and @reyang I appreciate the feedback. Both of you are, I think, suggesting to make UTF-8 validation an opt-in instead of an opt-out feature. I support that motion. The most critical thing for me is that if the SDK is configured with a permissive stance (opt-out), the SDK "MUST" configure its underlying technologies in support.
Opting-out does not mean doing nothing, in other words, it means explicitly configuring a pipeline to permit invalid UTF-8 unless a user opts-in to UTF-8 validation.
When UTF-8 validation is selected (opt-in), it seems we have two options: (a) reject individual items, (b) correct invalid UTF-8. Do either of you think both of these options are worthwhile? I think (b) should be preferred, but I would accept (a) too.
When UTF-8 validation is selected (opt-in), it seems we have two options: (a) reject individual items, (b) correct invalid UTF-8. Do either of you think both of these options are worthwhile? I think (b) should be preferred, but I would accept (a) too.
I think if we have very limited bandwidth, we should do (b). (a) can be added later if we see a huge demand. One technical detail - I think for attribute values with string type, we should do some correction, for attribute names that have invalid UTF-8, it could be a very bad idea. I'm a bit on the fence here...
OTEPs have been moved to the Specification repository. Please consider re-opening this PR against the new location. Closing.