Define restrictions for attribute and label keys
The spec should define restrictions for the keys used in span/link/event/resource attributes as well as metric labels. This allows exporters, protocols and back ends to have a reliable definition on which keys they must expect to receive and process. Furthermore, from a user perspective, this assures users that the keys they define can be processed by every compliant exporter and back end. Without any restriction, it is highly likely that SDKs, exporters and back ends introduce their own arbitrary limits or make assumptions on the data they receive. This could result in broken instrumentations when users define keys that are valid in one setup and then switch to another which uses different limits or has different assumptions.
I would propose to define following restrictions:
- a clearly defined set of allowed characters
- case sensitivity
- a maximum length
For metric names we already have defined the first two of these although I would like have the character set clarified more clearly. A length limit is currently missing. https://github.com/open-telemetry/opentelemetry-specification/blob/ac75cfea2243ac46232cbc05c595bb0c018e2b58/specification/api-metrics-user.md#L61-L66
OTel-Java, for example, rejects metric names and resource labels that are longer than 255 chars or contain non-printable characters. I doubt that this is aligned across the other language implementations.
We will also have to decide on how we want to handle violations of those restrictions:
- reject the key silently
- replace the key with a default to make it easier to discover the error
- try to "fix" the name by removing/replacing illegal characters and cropping it to the maximum length
related: #446
I support a size limit of 255 bytes.
May we settle on the ANSI X3.4-1986 (i.e., "US-ASCII") character set for encoding?
Can we say the same for Span and Metric names as for attribute/resource/label keys? (😢 I was hoping to put emojis in my metric names.)
As for handling violations, I am in favor of specifying that the SDK MUST NOT reject the key, silently or otherwise. SDKs may take any of the other options but MUST yield a valid name somehow. For the default SDK, I'm in favor of trying to fix the key. For empty keys and null keys, I'm happy with "empty" and "null", but the file and line or class and method name would offer more help in locating the violation.
(ISO 646)
A limit of 255 sounds reasonable to me as well.
Restricting it to a subset of ASCII (non-control, non-space, ...) is also most reasonable I think. This allows for maximum compatibility as this should be consumable by all possible back ends. I can't think of use cases that would yield to other sensible names (although emojis could of course be funny).
As for handling violations I'd like to hear some more opinions from others on here as well.
A concern in the spec SIG meeting that came up was, for which strings this should apply:
- Attribute and label keys?
- Span and event names?
- Metric label values?
- Other attribute values?
I want to make sure we are thinking about globalization. I saw many customers who wants to use native language to name spans or custom attributes. If there is no technical reasons to limit character set I would argue the product will be more inclusive if we will allow unicode characters in names and especially in values. Operators using the collected data may be better off with the names in the native language.
Size limits are very reasonable and needed. Size limits for string values is a good way to address security and malicious attacks concerns.
Choosing a non-7-bit character encoding will support internationalization, but it introduces a lot of complications for SDKs to be "Correct". For example, it becomes incorrect to truncate strings on an arbitrary byte offset. It also requires, sometimes and in some languages, the programmer to have explicit knowledge of the data they pass in--for example in C++, if we require strings to use a UTF8 encoding and the user supplies a string that is not UTF8-encoded, are we required to sanitize it? This seems to either introduce correctness challenges or implementation challenges.
I find myself quietly wishing for Unicode and UTF8 here, but it has a lot to do with working in Go long enough to forget these headaches in C++. Just think--we can put emojis in our span names! (sorry not sorry)
I don't think you'll find something that doesn't need sanitation.
Some system is not going to support slashes, and some system is not going to support hyphens, and some other system won't like quote characters.
Most telemetry libraries currently do client-side sanitation. I don't think you're going to get away from that without being draconian.
for example in C++, if we require strings to use a UTF8 encoding and the user supplies a string that is not UTF8-encoded, are we required to sanitize it?
Do the same thing you would do if the users passes NaN for the metric value. Blow up or ignore it or whatever is best for the language's paradigm, and document that decision.
Some system is not going to support slashes, and some system is not going to support hyphens, and some other system won't like quote characters.
@pauldraper I would like to add more lenient restrictions for what can be considered common and leaving further special restrictions (like not allowing slashes, hyphens, underscores) up to the exporters which should replace strings according to the system they export to. This way, users can rely that the data they provide can be handled by (at least almost) all systems since exporters will be able to transform it to something valid. Having a minimal supported (and thus maximal recommended) length would also help a lot.
Regarding character set and encoding:
ASCII would be the easiest and most foolproof solution but I agree with @SergeyKanzhelev that we should take internationalization into account. It would feel odd to build a "cloud-native" solution in 2020 that doesn't allow international characters (or emojis 🤷♂).
@jmacd I don't think we need to define an encoding in the API. This is up to the exporter and wire protocol. We should only define the allowed character set (e.g., printable Unicode characters) and limit keys to a certain number of characters (not bytes) at which they are truncated. If a system only understands ASCII, then the exporter for that system should replace other characters accordingly. WDYT?
Has there been any decision?
I would vote for UTF8 for everything.
And if using UTF8 then the easiest is to limit on bytes, I think? But require that the truncation happen on grapheme cluster and not code-point. Meaning if a span has 👩👩👦👦 in the name it will either appear as the whole family or be removed entirely, not possibly truncate to 👩👩👦.
May also have to define normalization as NFC so that comparisons are the same.
A useful resource popped up on HN today: http://utf8everywhere.org/
People (incl myself 😉) are putting 😄-reactions under @tsloughter's comment, but these are all real problems. Unicode is complex.
require that the truncation happen on grapheme cluster
I think this requires (relatively) large Unicode databases and library support that not everyone has. I guess we can agree that we should at least only have valid UTF-8 after truncation (so no mid-codepoint truncation), because that is easy to implement in all languages (even without lib support if needs be).
If truncation happens at all, that's already an error condition, so I wouldn't put too much effort into trying to make the error as small as possible.
May also have to define normalization as NFC so that comparisons are the same.
Unicode normalization has the same problem of being hard to implement. I think normalization is a nice-to-have, but should not be a MUST-requirement. There should be a recommendation for which normal form to use if you normalize though, and a recommendation for users to already normalize their input. Where this is IMHO most important will be metric labels (keys and especially values), because this influences client-side aggregation.
And there might be cases where you don't want to normalize, e.g. when you implement metrics for some Unicode normalization microservice (just pulling this example out of thin air here) and using the un-normalized input words as label values.
I would be comfortable with a specification saying that everything is encoded in utf8, and that when truncating an SDK is permitted to use byte-level truncation, leaving an invalid code point. Consumers of this data should learn to deal with truncated content, mark it as such when presenting to the user.
I have a couple issues with that. First, truncating doesn't necessarily leave an invalid code point, it could leave a valid codepoint but a grapheme that is different from the one the user used.
But also, this entails not that the consumer library be able to properly handle UTF8 but that it can also handle a mixture of valid and invalid UTF8 and being able to truncate the invalid part.
Well, http://utf8everywhere.org/ seems to say byte-level truncation is the way to go (says code-unit truncation but in the case of UTF8 that is a byte, so same thing). And I won't argue with that as the author knows much much more about UTF8 than I do :)
Aside from truncation we'll need recommendation on what to do when exporting to something that only supports ASCII. Like what should the Prometheus exporter do for utf8 metric names that go beyond the ascii characters?
Maybe make the requirement/suggestion that such characters are replaced by _?
So that 流量 becomes __? Does not sound very helpful. We could use urlencoding to at least avoid introducing collisions. But I guess there is only so much we can do for ASCII-only systems.
Ah yea, good point about conflicts, I was thinking specifically about Prometheus which doesn't simply limit to ascii but to [a-zA-Z_:][a-zA-Z0-9_:]* so url encoding wouldn't be an option.
I guess in the case of a system that allows all of ascii we could suggest url encoding and for something like Prometheus we could have our own form of url encoding? As in, instead of % use : which appears to be allowed in Prometheus string.
It would feel odd to build a "cloud-native" solution in 2020 that doesn't allow international characters (or emojis man_shrugging).
You're going to have guidelines for permitted characters and those will include EMOJIS?
Like, it's not safe to a question mark, but it is safe to use a unicycle?
I'm confused by the encoding thing. E.g. https://github.com/open-telemetry/opentelemetry-python has attributes/labels as character strings. AFAIK they aren't ever encoded into bytes (except by whatever Tracer implementation is being used which a free variable as far as the specification is concerned). Should they be byte strings instead??
Aren't python strings now always utf8? Is there a difference with "byte strings"?
Aren't python strings now always utf8?
No, they are character strings. (If you mean the internal in-memory representation....I think CPython actually uses byte arrays of various encodings, but that's never exposed to the user.)
You can convert a character string (character sequence) to a byte string (byte sequence) by choosing an encoding and calling str.encode(encoding). AFAIK the Python OTel API is all character strings.
As, I imagine, many languages are: Ruby, Rust, Go. (Java, JS and C++ don't, because they've never really gotten a proper Unicode character string....just byte or double-byte arrays.)
Has there been any movement on this outside of this issue? I don't see any PRs linked to so probably not?
I had another thought for instead of _ to replace utf8 chars when exporting to services like Prometheus: replace the codepoint with its hexadecimal representation.
So 流量 becomes 0xe60xb50x810xe90x870x8f0x0aor, e6b581e9878f0a.
And I just realized urlencoding was already mentioned which is essentially the same but with % which Prometheus doesn't allow.
But at least reviving this issue!
So we would allow all (printable?) Unicode characters in the API and exporters are responsible themselves for applying any encoding needed to be ingested by their respective target backend? This sounds reasonable to me.
As initially proposed, I think we should still define general sanity restrictions on length and decide on case-(in)sensitivity of attribute and label keys as well as span name, including a definition on how violations have to be handled (see issue description for proposed options).
I would vote for being case sensitive. The only thing we might do beyond codepoint-to-codepoint comparison is some Unicode normalization. Case sensitivity is only possible for e.g. ASCII compatible characters and it would be surprising if Overruns was aggregated with overruns but Überläufe not with überläufe. Note that it is not only high-effort but actually impossible to do proper case-insensitive comparison without knowing the language. E.g. i (yes, plain old ASCII i) is I in most languages when uppercased, but in Turkish, the same character becomes İ. https://en.wikipedia.org/wiki/Dotted_and_dotless_I
@arminru right.
Does the OTLP protocol have size limits on strings like metric name, attributes and labels? If so we should just use the same for any restriction within the API/SDK.
Agree with @Oberon00 on case sensitivity.
And to do so we need to define the normalization each library must use? NFC or NFD probably fine? I've seen references to NFKC being preferred for identifies for security reason, but that must refer to like username identifiers? So no need to go that route for metric names, attributes or labels.
Just came here from an internal discussion and was surprised this is (a) not resolved, and (b) classified after-GA.
@AndrewAXue you removed the after-ga label, what was the justification for that?
For the reference, W3C Baggage spec (https://w3c.github.io/baggage/) only allows US_ASCII for baggage keys. Granted, baggage has more narrow scope than trace attributes.