nexmark benchmark have unstable performance
here are the worst offenders (stdev > 2) using 10 runs on the CI machine same invocation as the ci.bash script:
| query | ('elapsed', 'mean') | ('elapsed', 'std') | ('elapsed', 'min') | ('elapsed', 'max') |
|---|---|---|---|---|
| 7 | 29.5857 | 5.72089 | 24.0367 | 38.6235 |
| 13 | 35.6026 | 6.49526 | 28.4508 | 47.6523 |
| 18 | 125.103 | 2.85732 | 119.044 | 127.782 |
| 19 | 108.673 | 4.28322 | 102.351 | 118.69 |
all of the queries for completeness
| query | ('elapsed', 'mean') | ('elapsed', 'std') | ('elapsed', 'min') | ('elapsed', 'max') |
|---|---|---|---|---|
| 0 | 17.8558 | 0.225236 | 17.704 | 18.4677 |
| 1 | 17.6888 | 0.0603948 | 17.5839 | 17.7779 |
| 2 | 17.9274 | 0.299351 | 17.6874 | 18.668 |
| 3 | 18.1124 | 0.256843 | 17.7715 | 18.706 |
| 4 | 30.6734 | 0.495057 | 29.5523 | 31.2966 |
| 5 | 18.155 | 0.258037 | 17.9003 | 18.8023 |
| 6 | 31.0295 | 0.843263 | 29.7408 | 32.6297 |
| 7 | 29.5857 | 5.72089 | 24.0367 | 38.6235 |
| 8 | 18.6855 | 0.119781 | 18.5606 | 18.9273 |
| 9 | 178.051 | 1.7719 | 176.371 | 181.136 |
| 12 | 19.4357 | 0.208011 | 19.1634 | 19.8744 |
| 13 | 35.6026 | 6.49526 | 28.4508 | 47.6523 |
| 14 | 17.7943 | 0.323826 | 17.5242 | 18.518 |
| 15 | 19.4828 | 0.371601 | 18.9686 | 20.2441 |
| 16 | 138.703 | 0.316535 | 138.17 | 139.102 |
| 17 | 34.0763 | 0.462141 | 32.9988 | 34.5786 |
| 18 | 125.103 | 2.85732 | 119.044 | 127.782 |
| 19 | 108.673 | 4.28322 | 102.351 | 118.69 |
| 20 | 98.7908 | 1.46687 | 96.5369 | 101.601 |
| 21 | 19.9293 | 0.201926 | 19.752 | 20.2957 |
| 22 | 17.7938 | 0.096477 | 17.6597 | 18.006 |
1 core, 4 generator, 20M events, things are fine:
| name | ('elapsed', 'mean') | ('elapsed', 'std') | ('elapsed', 'min') | ('elapsed', 'max') |
|---|---|---|---|---|
| 0 | 7.70756 | 0.0121255 | 7.69477 | 7.73248 |
| 1 | 11.2202 | 0.0302091 | 11.1786 | 11.2689 |
| 2 | 7.80124 | 0.0169123 | 7.78096 | 7.82983 |
| 3 | 8.28608 | 0.0256119 | 8.2513 | 8.32688 |
| 4 | 13.9969 | 0.0648266 | 13.9327 | 14.1596 |
| 5 | 8.15482 | 0.014006 | 8.1358 | 8.18143 |
| 6 | 15.1791 | 0.0764544 | 14.987 | 15.2693 |
| 7 | 18.2843 | 0.0545263 | 18.1988 | 18.3711 |
| 8 | 8.3394 | 0.0280169 | 8.31073 | 8.39335 |
| 9 | 57.2127 | 0.185577 | 56.9653 | 57.5052 |
| 12 | 10.9224 | 0.028771 | 10.8843 | 10.993 |
| 13 | 21.2305 | 0.154159 | 20.9139 | 21.3537 |
| 14 | 11.36 | 0.016742 | 11.3369 | 11.3922 |
| 15 | 16.7001 | 0.0136383 | 16.6682 | 16.7181 |
| 16 | 119.59 | 0.221919 | 119.163 | 119.981 |
| 17 | 26.7408 | 0.0783941 | 26.6099 | 26.867 |
| 18 | 30.2228 | 0.213105 | 29.921 | 30.5853 |
| 19 | 38.9964 | 0.235482 | 38.5843 | 39.2574 |
| 20 | 24.9274 | 0.173651 | 24.7215 | 25.2335 |
| 21 | 13.0763 | 0.0133133 | 13.0526 | 13.0995 |
| 22 | 9.52553 | 0.0118613 | 9.50356 | 9.54437 |
seems like problems start with any amount of cores >1:
2 cores, 4 generators:
| name | ('elapsed', 'mean') | ('elapsed', 'std') | ('elapsed', 'min') | ('elapsed', 'max') |
|---|---|---|---|---|
| 13 | 29.6543 | 5.1912 | 24.5159 | 34.6646 |
| 16 | 137.624 | 2.98478 | 129.609 | 139.561 |
3 cores, 4 generators:
| name | ('elapsed', 'mean') | ('elapsed', 'std') | ('elapsed', 'min') | ('elapsed', 'max') |
|---|---|---|---|---|
| 13 | 19.821 | 4.50558 | 14.9431 | 28.5657 |
| 18 | 29.6797 | 2.94843 | 26.961 | 34.9978 |
4 cores, 6 generators:
| name | ('elapsed', 'mean') | ('elapsed', 'std') | ('elapsed', 'min') | ('elapsed', 'max') |
|---|---|---|---|---|
| 13 | 29.9042 | 2.44664 | 22.957 | 30.9573 |
| 16 | 115.823 | 2.00773 | 113.875 | 119.531 |
@gz and I wrote a quick bench of the Nexmark generator in isolation and it appears to have inconsistent performance depending on
- the setting for
num_event_generators - how many events have been generated so far.
In the following each line is a single-threaded NexmarkGenerator run, configured with different num_event_generators, and each datapoint in a line is the average throughput over 2M events generated, so the first datapoint is the first 2M events, the second datapoint is from event 2M to event 4M, etc.
The performance seems predictable across runs, but seems to be dependent on the event ids of the events being generated.
num_event_generators = 1: 4816330.26 1144862.92 2307019.61 5137148.50 893611.64 hz
num_event_generators = 1: 5096182.48 1141302.12 2304576.70 5139885.19 896300.56 hz
num_event_generators = 1: 5077263.25 1133725.47 2285421.86 5086664.58 893150.26 hz
[...]
num_event_generators = 2: 7410578.64 1281839.59 1707982.83 7423415.87 1282788.71 hz
num_event_generators = 2: 7413553.35 1282431.70 1696598.43 7418033.05 1278006.75 hz
num_event_generators = 2: 7410929.90 1281895.98 1707797.73 7416785.99 1282502.35 hz
[...]
num_event_generators = 3: 5084027.86 1134391.37 2286589.83 5106829.56 890007.10 hz
num_event_generators = 3: 5078404.75 1135183.49 2294638.90 5126034.24 898400.62 hz
num_event_generators = 3: 5123281.64 1141378.24 2289420.31 5090503.86 895219.45 hz
The code for this is here: https://github.com/vmware/database-stream-processor/pull/282
Thanks @utaal and @gz! This does seem odd to me. The Nexmark generator has all kinds of interesting throttling logic, but your benchmark isn't using any of that as it just runs one single-threaded instance of NexmarkGenerator.
@absoludity, I know it's been a while since you worked on this, but I was wondering if you have any idea what could be going on here?
Hi there - yes sure, I'd love an excuse to look back here :)
Yes, looking at the pattern, it's consistently periodic within each set (for a specific number of generators), which smells of something that is ticking over... like a non-random generator that is incrementing and wrapping around :)
If I run the same command (thanks for including the code in the linked PR), with this change (ie. not incrementing the mock random number generator so just having a constant internally to the StepRng):
diff --git a/benches/nexmark-gen/main.rs b/benches/nexmark-gen/main.rs
index 8247934..b4a43b0 100644
--- a/benches/nexmark-gen/main.rs
+++ b/benches/nexmark-gen/main.rs
@@ -27,7 +27,7 @@ fn main() -> Result<()> {
},
..Config::default()
};
- let mut generator = NexmarkGenerator::new(config, StepRng::new(0, 1), 0);
+ let mut generator = NexmarkGenerator::new(config, StepRng::new(0, 0), 0);
print!("num_event_generators = {num_event_generators}: ");
I get the following, more consistent results:
cargo bench --bench nexmark-gen --features="with-nexmark"
Finished bench [optimized + debuginfo] target(s) in 0.15s
Running benches/nexmark-gen/main.rs (target/release/deps/nexmark_gen-eca28ca9c1692658)
num_event_generators = 1: 4536998.07 4763793.38 4688958.31 4680736.85 4622709.62 hz
num_event_generators = 1: 4851469.37 4844114.19 4682973.02 4552660.53 4560906.51 hz
num_event_generators = 1: 4492503.96 4808102.23 4663286.39 4591567.07 4440258.58 hz
[...]
num_event_generators = 2: 4423671.30 4903945.28 4875307.81 4853178.01 4884811.40 hz
num_event_generators = 2: 4684526.34 4525580.48 4375612.90 4383071.28 4275201.73 hz
num_event_generators = 2: 4582011.08 4503172.84 4825768.50 4944503.63 4853727.82 hz
[...]
num_event_generators = 3: 4937228.55 4760260.35 4799119.30 4843592.55 4763545.15 hz
num_event_generators = 3: 4853271.26 4786734.60 4727096.77 4716394.81 4785356.35 hz
num_event_generators = 3: 4791791.84 4764923.75 4836950.47 4842538.89 4725423.40 hz
[...]
So I predict that if you test the mock StepRng's implementation of gen_range (and/or any other RngCore methods we use here), that the elapsed time increases as the StepRng's internal u64 increases, until it wraps around again.
Another test would be to switch and try with the thread_rng where we expect there to be no period, just evenly distributed elapsed time. So running with:
diff --git a/benches/nexmark-gen/main.rs b/benches/nexmark-gen/main.rs
index 8247934..152d9b3 100644
--- a/benches/nexmark-gen/main.rs
+++ b/benches/nexmark-gen/main.rs
@@ -27,7 +27,7 @@ fn main() -> Result<()> {
},
..Config::default()
};
- let mut generator = NexmarkGenerator::new(config, StepRng::new(0, 1), 0);
+ let mut generator = NexmarkGenerator::new(config, rand::thread_rng(), 0);
print!("num_event_generators = {num_event_generators}: ");
we get consistent results as well, though see that the rng is a significant source of CPU:
cargo bench --bench nexmark-gen --features="with-nexmark"
Compiling dbsp v0.1.0 (/home/michael/dev/vmware/database-stream-processor)
warning: unused import: `rand::rngs::mock::StepRng`
--> benches/nexmark-gen/main.rs:6:5
|
6 | use rand::rngs::mock::StepRng;
| ^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: `#[warn(unused_imports)]` on by default
warning: `dbsp` (bench "nexmark-gen") generated 1 warning
Finished bench [optimized + debuginfo] target(s) in 4.47s
Running benches/nexmark-gen/main.rs (target/release/deps/nexmark_gen-eca28ca9c1692658)
num_event_generators = 1: 1728821.68 1737448.25 1729492.38 1724809.79 1711621.66 hz
num_event_generators = 1: 1716472.95 1734710.36 1717877.07 1705638.06 1790052.31 hz
num_event_generators = 1: 1802534.80 1783204.56 1719134.73 1778285.74 1741891.40 hz
[...]
num_event_generators = 2: 1717869.98 1688106.30 1687786.88 1683899.78 1679868.42 hz
num_event_generators = 2: 1736636.88 1825340.98 1737837.61 1670449.03 1676080.63 hz
num_event_generators = 2: 1710243.81 1709950.99 1692023.62 1696556.19 1705105.06 hz
[...]
num_event_generators = 3: 1753594.91 1757055.92 1720779.28 1730545.30 1744094.04 hz
num_event_generators = 3: 1661194.03 1754594.32 1740970.11 1722975.19 1622709.08 hz
num_event_generators = 3: 1510273.14 1557155.04 1513515.73 1514255.19 1524858.61 hz
[...]
I've not tried to run this in isolation, but would be interested to see the results (as my machine has a bunch of other things running). And note that all the above is with a single core, so perhaps not surprising that the frequency isn't increasing with more generators.
The StdRng can be used for slightly better performance (or even SmallRng but looks like we've not updated to the latest rand package to use this).
The performance seems predictable across runs, but seems to be dependent on the event ids of the events being generated.
So I'm not 100%, but the above suggests that at least this last experiment is not indicating that the performance is dependent on the generated event ids so much as the StepRng's performance is perhaps dependent on its internally wrapping u64?
That said, I'm sure there may be other underlying issues that you're sniffing out. I'd be keen to understand them (for example, the original tests from @gz above are showing abnormal stdevs for certain queries with the thread_rng).
This is an interesting conundrum!
I tried criterion but it doesn't seem to be designed for this situation: it runs the function-under-test many times to average out the measurement error, and I think we may be looking for outliers (which makes things tricky using time):
Here's some more data with some more code and my library to collect a histogram of latencies. To start, the throughput of the rngs on my machine:
(just the rng) rng = StepRng
throughput (hz) every 10000000 samples:
827472072.82 913172541.30 1055710809.56 1166339489.00 1394246474.61 1329838871.40 1416656336.88 1469597769.62 1681920847.80 1553388164.83
(just the rng) rng = SmallRng
throughput (hz) every 10000000 samples:
154736385.60 155028746.52 154952874.96 154970884.85 154027135.73 154271787.33 155077028.85 154772107.74 154346295.21 155136471.61
(just the rng) rng = ThreadRng
throughput (hz) every 10000000 samples:
64373935.32 64455675.85 64548490.31 64551702.02 64443024.45 64546927.91 64561078.95 64291833.49 64524073.40 64512418.11
StepRng is (not surprisingly) extremely fast. ThreadRng is the only one that seems like it may significantly affect the performance of next_event (from which we've seen <10M events/sec).
Next, I set up a harness that measures each invocation of next_event using Instant::now, and I tested the harness in isolation (considering that these regular timing APIs are not super-precise):
(just the time calls)
throughput (hz), every 1000000 samples:
38620352.78 39100175.16 39213764.03 38777100.54 39119294.80 38836459.67 39170563.32 39226134.27 39199098.14 39159442.57
ccdf of latencies (ns) for 10000000 samples:
╭ 0.000000e0 4.000000e1 4.000000e1 4.000000e1 4.000000e1 4.000000e1 2.969600e4 ╮
| 1.000000e0 4.200000e1 4.200000e1 4.200000e1 4.200000e1 4.200000e1 3.072000e4 |
╰ [ 0.25 0.5 0.75 ]--- 0.95 ------- 0.99 ------- 0.999 -------| max ╯
We can measure up to ~40M events/sec, and the error is up to ~30 micros, but generally within 40 nanos. How to read the ccdf: the number at the bottom is the fraction, and the values on top are lower and upper bounds approximating the respective quantile. So 99.9% of the time, we saw a latency of up to 42 nanos.
Then I tried with next_event and the various rngs:
== num_event_generators = 1 ==
rng = StepRng
throughput (hz), every 1000000 samples:
6756914.68 3751443.14 2391624.06 730808.67 1620602.94 3630666.25 6770616.35 3752520.65 922144.92 863078.00
ccdf of latencies (ns) for 10000000 samples:
╭ 8.000000e1 8.000000e1 1.240000e2 3.680000e2 4.160000e2 4.480000e2 2.852127e8 ╮
| 8.400000e1 8.400000e1 1.280000e2 3.840000e2 4.320000e2 4.640000e2 3.019899e8 |
╰ [ 0.25 0.5 0.75 ]--- 0.95 ------- 0.99 ------- 0.999 -------| max ╯
rng = SmallRng
throughput (hz), every 1000000 samples:
4082606.80 4025723.03 4146441.26 4153186.12 4151952.48 4160061.13 4144013.45 4166377.35 4167902.57 4175421.29
ccdf of latencies (ns) for 10000000 samples:
╭ 1.600000e2 2.080000e2 2.080000e2 7.360000e2 8.640000e2 9.280000e2 5.324800e4 ╮
| 1.680000e2 2.160000e2 2.160000e2 7.680000e2 8.960000e2 9.600000e2 5.529600e4 |
╰ [ 0.25 0.5 0.75 ]--- 0.95 ------- 0.99 ------- 0.999 -------| max ╯
rng = ThreadRng
throughput (hz), every 1000000 samples:
1173083.83 1171903.78 1154941.74 1159426.73 1155707.51 1157825.62 1154580.42 1160077.33 1145684.36 1152294.38
ccdf of latencies (ns) for 10000000 samples:
╭ 5.120000e2 5.760000e2 8.320000e2 3.072000e3 3.840000e3 4.096000e3 6.553600e4 ╮
| 5.440000e2 6.080000e2 8.640000e2 3.200000e3 3.968000e3 4.352000e3 6.963200e4 |
╰ [ 0.25 0.5 0.75 ]--- 0.95 ------- 0.99 ------- 0.999 -------| max ╯
The results for SmallRng and ThreadRng are consistent and they both take up to ~70 micros, but it seems like next_event with StepRng very rarely (<.1% of the time) takes up to 300 milliseconds, which would explain the inconsistencies between the various batches of next_event in our previous experiments.
To double-check, I collected latencies for 100 calls to StepRng::gen_range(0..1024), and got:
ccdf of latencies (ns) for 1000000 samples of batches of 100 calls:
╭ 4.000000e1 4.000000e1 4.000000e1 4.000000e1 8.000000e1 1.240000e2 6.553600e5 ╮
| 4.200000e1 4.200000e1 4.200000e1 4.200000e1 8.400000e1 1.280000e2 6.881280e5 |
╰ [ 0.25 0.5 0.75 ]--- 0.95 ------- 0.99 ------- 0.999 -------| max ╯
where the max is 680 micros, which is significantly lower than those 300 milliseconds.
My best guess here is that it isn't that StepRng is slow sometimes, but that it produces sequences that cause next_event to be slow (maybe it makes next_event do more work sometimes?).
That would mean that StepRng isn't the issue per se but it is a poor choice to run any benchmarks with the Nexmark generator.
Wow, thanks for the detailed analysis, @absoludity and @utaal. My takeaways:
- The original anomaly reported by @utaal is caused by the use of
StepRng, since the Nexmark generator expects even distribution. - I think the best way to eliminate throughput fluctuations related to the generator is to pre-generate the entire input sequence before feeding it to DBSP. The generator is pretty complicated because of the semantics of the Nexmark benchmark, so it may be too much of a challenge to make its performance predictable. Even if we do, generator threads may still interfere with DBSP threads. A better solution is to add a new benchmarking mode where the generator runs to completion, generating a complete event sequence, possibly even partitioned into batches, which are pushed to DBSP during the second phase. This is not suitable for comparative benchmarking against Flink or other Nexmark implementations, but it should be good enough for catching DBSP performance regressions.
I also looked into this the other day and my best guess is that StepRng's distribution is such that it creates skewed distributions when we call next_string() or next_extra(), occasionally creating much longer strings that have to be filled with random data and heap allocated (twice).
If we use an rng that doesn't have a terrible distribution but is still deterministic (like I do in some of our other benchmarks) it should be a non-issue since every run's event generation will be identical (xoshiro or PCG-64 would be my recommendations).