opentelemetry-specification attributes: specify attribute string limits act on the grapheme level

I'm not 100% sure this addition is needed to make it clear but after seeing the issue that truncation can create invalid utf8 in some languages I wanted to ensure the implementations are consistent. In Erlang we consider the attribute value length limit to refer to grapheme clusters and truncate accordingly. I still need to check other implementations.

This PR does not cover what happens if a user of the API supplies invalid utf8 (drop the attribute I assume?) but I can add that somewhere.

Sep 09 '22 18:09 tsloughter

I would prefer to not burden SDKs with UTF-8 correctness. See the comment here: https://github.com/open-telemetry/opentelemetry-specification/issues/3421

Sep 09 '22 18:09 jmacd

truncation can create invalid utf8 in some languages

Byte-level truncation can create invalid utf8, period -- it's just that (we've noticed) protobuf libraries are not consistent, so this is a serious problem in some languages.

Sep 09 '22 18:09 jmacd

I meant specifically the functionality of "attribute string truncation can create...", not just "truncation".

Sep 09 '22 18:09 tsloughter

Maybe you meant code points instead of graphemes. A grapheme (cluster) is something very complex, you need an unicode database at hand to do anything in that regard. http://utf8everywhere.org/#characters

Sep 12 '22 08:09 Oberon00

No, I meant grapheme cluster. They are the characters that should be kept together. Breaking them up can change meaning.

Sep 12 '22 11:09 tsloughter

What about just saying to use the languages main utf8 library truncation function? Which would be cluster, like in Erlang's case, or coded character.

Sep 12 '22 12:09 tsloughter

I believe Erlang is an outlier here. A "default truncation function" will not exist for most languages I'm aware of. Many will use UTF-16 code units as their basic units though.

Also I think "Breaking them up can change meaning" is true, but not specific to mid-cluster truncation. Consider a long exception stacktrace that ends with the ASCII characters Line 12345 but after truncation, it becomes Line 123. This definitely changed meaning, even though no grapheme clusters were broken.

Sep 12 '22 14:09 Oberon00

There's two things that bug my about this PR:

The fact that I think you're implying returning invalid UTF-8 (sending one grapheme that expects a pair is invalid UTF-8 string). The spec does say: SDKs MUST truncate that value, so that its length is at most equal to the limit, but it also says String values which are not valid Unicode sequences SHOULD be converted to AnyValue's bytes_value with the bytes representing the string in the original order and format of the source string. So if you're truncating into invalid UTF-8, you need to use bytes, impliciltly.
We have no way to identify if a string was truncated by the instrumentation. As @Oberon00 calls out Line 12345 to Line 123 is very different.

I'd suggest we alter the specification slightly here. It's meant to leave implementation open to SDKs a bit. So here are two valid interpretations that I think a better for users:

We truncate with a known terminal string and cut the string as necessary to include this string (not just at grapheme level). So Line 12345 becomes Line{tr}. Note: the string REMAINS valid UTF-8
We truncate the ENTIRE string (acceptable by specification).

Either way I'd highly advice against returning invalid UTF-8. Not only does it violate the spec, but I believe it breaks protocol buffers.

Sep 12 '22 17:09 jsuereth

@jsuereth see https://github.com/open-telemetry/opentelemetry-specification/issues/3421

Sep 12 '22 17:09 jmacd

I went to check other languages and found https://blog.saeloun.com/2019/02/23/rails-6-string-truncate-bytes.html

Obv it is Rails and not Ruby, main point is the use of the ellipsis "✨…"

Also, for another issue/pr, but requires figuring this out first, if "VALUE_LENGTH" is not based on a set size of each "character" (EDIT: to be clear here, this would be a utf8 "code unit") in the string we may want a VALUE_BYTES_LIMIT to optionally give the user a way to set a limit to the byte size of the string value, while still keeping it as valid utf8.

Sep 12 '22 18:09 tsloughter

I really like the "..." solutions there...

Also, for another issue/pr, but requires figuring this out first, if "VALUE_LENGTH" is not based on a set size of each "character" (EDIT: to be clear here, this would be a utf8 "code unit") in the string we may want a VALUE_BYTES_LIMIT to optionally give the user a way to set a limit to the byte size of the string value, while still keeping it as valid utf8.

I agree, I think the intention here is byte-limits.

Sep 12 '22 19:09 jsuereth

I also think we should never truncate to invalid UTF-8, but grapheme clusters are something different (maybe you are thinking about UTF-16 surrogate pairs @jsuereth; these must not be broken either, that's relevant for many technologies like Java and .NET where a string consists of UTF-16 code units).

Sep 13 '22 07:09 Oberon00

Grapheme clusters aren't "different", they are just one way to truncate to valid utf8. The way to truncate where the visible characters to a user stays the same (either they are there or they are not, instead of potentially changing to a new character).

Sep 13 '22 11:09 tsloughter

A good example of grapheme clusters is 👩 + 👩 + 👦 + 👦, where + is a special combining mark (a zero width joiner) between two women and two boys. This results in a 👩‍👩‍👦‍👦 unicode family, where the 7 code points represent a single "character" (grapheme cluster).

Dropping a code unit means that you can end up with a weird string ("👩 + 👩 + 👦 +") or flat out a new family of 👩‍👩‍👦 instead, so you dropped a child and altered the character that was given.

Another example is how the rainbow flag 🏳️‍🌈 (U+1F3F3 U+FE0F U+200D U+1F308) becomes a white flag 🏳️ (16#1F3F3, 16#FE0F, 16#200D) when you truncate starting at any of the 3 last code points. That's because the rainbow flag is composed of a white flag + variation selector + zero-width joiner + rainbow.

As such any truncation that ignores the boundaries of a grapheme cluster result in legit unicode, but also alters the meaning of the string in ways that are indistinguishable if you add a trailing "..." -- because it still displays the character, but a wrong one.

Sep 13 '22 15:09 ferd

This PR was marked stale due to lack of activity. It will be closed in 7 days.

Sep 21 '22 04:09 github-actions[bot]

I'm going to add to this that there be an ellipsis added if the string is truncated.

Another option would be to change the proto to include "dropped bytes" like is done when anything else is dropped (attributes, links, etc).

Sep 21 '22 11:09 tsloughter

I've also updated it to allow truncation at the cluster or code point, but prefering the grapheme cluster.

I don't know if the ellipsis can actually be accepted as a change since it would mean the limit is now Limit + 1? Otherwise it would have to be minus one. Either way, the same string truncated in the future would be either longer in total or shorter in content than it is today.

Sep 21 '22 11:09 tsloughter

This PR was marked stale due to lack of activity. It will be closed in 7 days.

Sep 29 '22 04:09 github-actions[bot]

I don't know if the ellipsis can actually be accepted as a change since it would mean the limit is now Limit + 1? Otherwise it would have to be minus one. Either way, the same string truncated in the future would be either longer in total or shorter in content than it is today.

So the limit would be the limit you want from a data storage pespective. From an SDK standpoint, we'd be setting the REAL limit to limit-1 so we can add an elipsis, IIUC.

Oct 04 '22 13:10 jsuereth

This PR was marked stale due to lack of activity. It will be closed in 7 days.

Oct 12 '22 03:10 github-actions[bot]

Closed as inactive. Feel free to reopen if this PR is still being worked on.

Oct 19 '22 04:10 github-actions[bot]