oteps Invalid UTF-8 error handling policy

This is a draft following research done into

https://github.com/open-telemetry/opentelemetry-specification/issues/3421

and

https://github.com/open-telemetry/opentelemetry-specification/issues/3950.

May 10 '24 23:05 jmacd

@XSAM This was discussed in the Spec SIG today. There appears to be not much support for binary-attribute values. I think it's bad for the users, but it's not so bad if we automatically correct invalid UTF-8. Therefore, I will move forward with only half of this proposal.

May 14 '24 15:05 jmacd

@open-telemetry/specs-logs-approvers @open-telemetry/specs-metrics-approvers @open-telemetry/specs-trace-approvers

Please consider this updated OTEP.

The changes I have applied:

The OTel group has already decided not to support byte-valued attributes: document this. (Tough!)
Specific wording for SDK requirements: SHOULD be opt-out, SHOULD replace invalid sequences w/ �, etc.
Specific wording for Collector "behavior": SHOULD be opt-out, SHOULD follow each receiver for validation of external data, not recommended for processor manipulations.

Oct 10 '24 23:10 jmacd

@jsuereth and @reyang I appreciate the feedback. Both of you are, I think, suggesting to make UTF-8 validation an opt-in instead of an opt-out feature. I support that motion. The most critical thing for me is that if the SDK is configured with a permissive stance (opt-out), the SDK "MUST" configure its underlying technologies in support.

Opting-out does not mean doing nothing, in other words, it means explicitly configuring a pipeline to permit invalid UTF-8 unless a user opts-in to UTF-8 validation.

When UTF-8 validation is selected (opt-in), it seems we have two options: (a) reject individual items, (b) correct invalid UTF-8. Do either of you think both of these options are worthwhile? I think (b) should be preferred, but I would accept (a) too.

Oct 11 '24 22:10 jmacd

When UTF-8 validation is selected (opt-in), it seems we have two options: (a) reject individual items, (b) correct invalid UTF-8. Do either of you think both of these options are worthwhile? I think (b) should be preferred, but I would accept (a) too.

I think if we have very limited bandwidth, we should do (b). (a) can be added later if we see a huge demand. One technical detail - I think for attribute values with string type, we should do some correction, for attribute names that have invalid UTF-8, it could be a very bad idea. I'm a bit on the fence here...

Oct 11 '24 22:10 reyang

OTEPs have been moved to the Specification repository. Please consider re-opening this PR against the new location. Closing.

Dec 04 '24 15:12 carlosalberto