client_rust icon indicating copy to clipboard operation
client_rust copied to clipboard

feat(encoding): implement UTF-8 support in metric and label names

Open fedetorres93 opened this issue 1 year ago • 14 comments

Adds UTF-8 support for metric and label names. Addresses https://github.com/prometheus/client_rust/issues/190.

These changes are based on the work done on the Prometheus common libraries https://github.com/prometheus/common/pull/537 and https://github.com/prometheus/common/pull/570

  • Encoders will use the new quoting syntax {"foo"} iff the metric does not conform to the legacy name format (foo{})

  • The Registry struct has two new fields: name_validation_scheme which determines if validation is done using the legacy or the UTF-8 scheme and escaping_scheme which determines the escaping scheme used by default when scrapers don't support UTF-8.

  • Scrapers can announce via content negotiation that they support UTF-8 names by adding escaping=allow-utf-8 in the Accept header. In cases where UTF-8 is not available, metric providers can be configured to escape names in a few different ways: values (U__ UTF value escaping for perfect round-tripping), underscores (all invalid chars become _), dots (dots become _dot_, _ becomes __, all other values become ___). Escaping can either be a global default (escaping_scheme, as mentioned above) or can also be specified in the Accept header with the escaping= term, which can be allow-utf-8 (for UTF-8-compatible), underscores, dots, or values. Existing functionality is maintained.

Work towards https://github.com/prometheus/prometheus/issues/13095.

fedetorres93 avatar Nov 05 '24 15:11 fedetorres93

@mxinden Please take a look, thanks!

fedetorres93 avatar Nov 11 '24 18:11 fedetorres93

@mxinden I just wanted to check in regarding this PR. Did you have a chance to take an initial look?

Looking forward to any feedback you might have. Thanks!

fedetorres93 avatar Dec 09 '24 19:12 fedetorres93

Hi @fedetorres93,

I am sorry for the delay. I appreciate the time you invested making a solid patch!

As I assume many of us do, I am maintaining this crate in my free time. Skimming the proposal, I don't see it adding a lot of value and thus I don't see myself prioritizing this work over other patches on this repository.

Now me not investing time into this obviously doesn't mean other can't. Maybe you can find a Prometheus maintainer to champion this work and do extensive reviews.

On a high level, I would want this work to (a) not introduce significant complexity to the library and (b) no performance regressions to the hot paths (metric recording and metric encoding).

mxinden avatar Dec 13 '24 15:12 mxinden

Hi @mxinden , thanks for your reply and continued maintenance of this library! With the release of Prometheus 3.0, one of our goals in adding support for UTF-8 everywhere in Prometheus includes updating all of the officially-supported client libraries, of which Rust is one. One issue we've had with getting these libraries updated is finding the necessary language experts such as yourself to certify that the changes we're making are safe and well-written. So far, you're the only person we've found who is both an expert in Rust and Prometheus.

As far as correctness goes, I can help do a review to make sure the test cases are covering all of the situations where quoting and escaping are required. For performance, I'm guessing there are ways of benchmarking Rust but I am not familiar with them. @fedetorres93 and I can work to try to generate those numbers. But for language correctness, we need a Rust expert to verify that we're adhering to Rust style.

I think splitting up the work this way will keep the burden on you as low as possible. Fede and I can go back and forth on the test coverage and performance and then call upon you again when we think we're in good shape. Does that sound workable?

ywwg avatar Dec 13 '24 16:12 ywwg

Works for me. Thanks!

mxinden avatar Dec 14 '24 16:12 mxinden

Thank you for your flexibility! @fedetorres93 let's start by really ramping up on unit tests, as I've worked on the Go common library I've had to add a bunch of edge cases

ywwg avatar Dec 16 '24 15:12 ywwg

@mxinden I added more tests to cover more edge cases. I think we're in better shape now, in case you want to take a look to see if we're okay regarding language correctness.

As for performance, the only difference I see in the benchmarks is some regression in text encoding (about +15 ms), which can be attributed to the new name validation and escaping.

encode                  time:   [43.848 ms 44.070 ms 44.317 ms]
                        change: [+37.747% +39.110% +40.430%] (p = 0.00 < 0.05)
                        Performance has regressed.

Please let me know if you think I may have missed some optimization opportunities since I'm not that familiar with Rust.

Thanks!

fedetorres93 avatar Jan 20 '25 15:01 fedetorres93

I pushed some changes by @sd2k (thanks for your help!) which improve the text encode benchmark results I shared before:

encode                  time:   [37.371 ms 37.433 ms 37.499 ms]
                        change: [+33.474% +34.031% +34.591%] (p = 0.00 < 0.05)
                        Performance has regressed.

fedetorres93 avatar Jan 22 '25 18:01 fedetorres93

encode time: [37.371 ms 37.433 ms 37.499 ms] change: [+33.474% +34.031% +34.591%] (p = 0.00 < 0.05) Performance has regressed.

A 35% performance regression seems significant, especially for a feature that the majority of users won't need. Do you see any ways to improve this? With the latest optimizations, is this inherent to using UTF-8?

mxinden avatar Jan 23 '25 13:01 mxinden

I tried other possible optimizations but didn't get better results than the last ones.

One thing to bear in mind is that, in the current version of the client, no validations are performed on metric or label names, so if I register a counter with a name like requests.foo*😱 and use the label name method.bar*😱 I'd get the following after the encoding, which aren't valid names before UTF-8 support (and don't use the proper syntax for UTF-8):

# HELP requests.foo*😱 Count of requests.
# TYPE requests.foo*😱 counter
requests.foo*😱_total{method.bar*😱="Get"} 1
# EOF

With these changes, even if you choose not to use the UTF-8 validation scheme, these names would be validated and properly escaped using some escaping scheme during encoding:

# HELP requests_foo__ Count of requests.
# TYPE requests_foo__ counter
requests_foo___total{method_bar__="Get"} 1
# EOF

fedetorres93 avatar Jan 29 '25 20:01 fedetorres93

One thing to bear in mind is that, in the current version of the client, no validations are performed on metric or label names, so if I register a counter with a name like requests.foo*😱 and use the label name method.bar*😱 I'd get the following after the encoding, which aren't valid names before UTF-8 support (and don't use the proper syntax for UTF-8):

# HELP requests.foo*😱 Count of requests.
# TYPE requests.foo*😱 counter
requests.foo*😱_total{method.bar*😱="Get"} 1
# EOF

Not doing metric name validation in release mode is by design, for the sake of performance. Adding validation in debug mode is discussed in https://github.com/prometheus/client_rust/issues/52.

mxinden avatar Feb 07 '25 09:02 mxinden

The validation is necessary to determine whether the name needs to be quoted, or can remain unquoted. If you'd prefer not to do validation, then we could just always quote the names and that will work for all names. That would require that scraping systems support the new quoting syntax (prom 3.0 and later with UTF-8 validation turned on).

It's not clear to me from the micro benchmarks how much of a "performance issue" this is in reality. I am not convinced that multiplying a tiny number by 138% will actually result in perceivable increases in cpu usage or latency. What are your specific concerns when it comes to the performance of this code path?

ywwg avatar Feb 07 '25 18:02 ywwg

This is the validation code, are there faster ways in Rust of doing the same thing? It could also be that our Go transliteration is inefficient https://github.com/prometheus/client_rust/pull/236/files#diff-f22a711df8ea59f8d7c4ee4242a7a2b9477943fb90526ec1f086e2a6ed70b162R784-R798

ywwg avatar Feb 07 '25 19:02 ywwg

The validation is necessary to determine whether the name needs to be quoted, or can remain unquoted.

This doesn't have to happen at scrape time, but could be done when creating the metric and its labels, couldn't it?

(Sorry for shouting from the sidelines. I haven't checked the internal of client_rust.)

beorn7 avatar Feb 11 '25 15:02 beorn7