client_rust Performance regresssion with (&'static str, &'static str) label set.

Hi, thanks for building the Prometheus client in Rust. Really love it.

When doing the benchmark, I added another test case with (&'static str, &'static str) label set as follows:

    c.bench_function(
        "counter family with (&'static str, &'static str) label set",
        |b| {
            let family = Family::<(&'static str, &'static str), Counter>::default();

            b.iter(|| {
                family.get_or_create(&("method", "GET")).inc();
            })
        },
    );

Considering there should be no dynamic allocation when accessing the metric with (&'static str, &'static str) label set, I suppose the performance would be the same as the custom type label set.

But the benchmark turned out oppositely. (27ns vs 15ns on avg)

     Running benches/family.rs (target/release/deps/family-f5815167dcf70f4a)
Gnuplot not found, using plotters backend
counter family with [(&'static str, &'static str)] label set
                        time:   [41.655 ns 42.005 ns 42.483 ns]
                        change: [-1.0231% +1.7906% +7.1726%] (p = 0.51 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

counter family with Vec<(&'static str, &'static str)> label set
                        time:   [62.579 ns 63.896 ns 65.827 ns]
                        change: [-1.8609% +0.9648% +4.0991%] (p = 0.58 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe

counter family with (&'static str, &'static str) label set
                        time:   [21.716 ns 21.771 ns 21.833 ns]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

counter family with Vec<(String, String)> label set
                        time:   [204.69 ns 208.56 ns 214.56 ns]
                        change: [-5.1373% -3.8646% -2.4135%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

counter family with custom type label set
                        time:   [14.648 ns 15.786 ns 17.877 ns]
                        change: [-8.7872% -2.4706% +6.7788%] (p = 0.61 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

I'm not sure where the regression comes from. Is it as expected or did I miss something?

Thanks again for your help. 🙏

Nov 26 '24 09:11 MrCroxx

~Does the overhead come from constructing the tuple? (e.g. &("method", "GET"))~

~IIUC, "method" and "GET" are &'static str, which does not need allocation, but the tuple is not?~

Nov 26 '24 09:11 MrCroxx

Appreciate your curiosity!

On a high level, I don't think it matters for a user whether an operation takes 40 or 20 nano seconds here.

That said, indeed, this is counterintuitive. But I am not able to reproduce it:

     Running benches/family.rs (/home/mxinden/code/github.com/prometheus/client_rust/target/release/deps/family-95f4365191b2b975)
Gnuplot not found, using plotters backend
counter family with [(&'static str, &'static str)] label set
                        time:   [57.968 ns 57.987 ns 58.008 ns]
Found 19 outliers among 100 measurements (19.00%)
  3 (3.00%) low severe
  3 (3.00%) low mild
  5 (5.00%) high mild
  8 (8.00%) high severe

counter family with (&'static str, &'static str) label set
                        time:   [34.901 ns 34.903 ns 34.905 ns]
Found 15 outliers among 100 measurements (15.00%)
  13 (13.00%) high mild
  2 (2.00%) high severe

Are you able to reproduce (&'static str, &'static str) being slower on consecutive runs?

Nov 28 '24 19:11 mxinden

Are you able to reproduce (&'static str, &'static str) being slower on consecutive runs?

Yep. Unfortunately, it is a stable regression with my machine. I'm running the benchmark with cargo bench on my MacBook Air with M2 chip. Let me try it on my linux server to see if the regression still exists.

Dec 03 '24 03:12 MrCroxx

On a high level, I don't think it matters for a user whether an operation takes 40 or 20 nano seconds here.

I agree that 20+ ns vs 15+ ns looks close digitally, but the overhead is 33% more in ratio. 🥹

Dec 03 '24 03:12 MrCroxx

BTW, can I open a PR to add the test case?

Dec 03 '24 03:12 MrCroxx

Maybe use https://play.rust-lang.org/ to take a look at the ASM and LLVM IR code. The reason might be obvious from there?

BTW, can I open a PR to add the test case?

I don't think the test itself, without an explanation for the regression would be helpful.

Dec 03 '24 08:12 mxinden