database-stream-processor icon indicating copy to clipboard operation
database-stream-processor copied to clipboard

nexmark benchmark have unstable performance

Open gz opened this issue 3 years ago • 9 comments

here are the worst offenders (stdev > 2) using 10 runs on the CI machine same invocation as the ci.bash script:

query ('elapsed', 'mean') ('elapsed', 'std') ('elapsed', 'min') ('elapsed', 'max')
7 29.5857 5.72089 24.0367 38.6235
13 35.6026 6.49526 28.4508 47.6523
18 125.103 2.85732 119.044 127.782
19 108.673 4.28322 102.351 118.69

gz avatar Nov 09 '22 18:11 gz

all of the queries for completeness

query ('elapsed', 'mean') ('elapsed', 'std') ('elapsed', 'min') ('elapsed', 'max')
0 17.8558 0.225236 17.704 18.4677
1 17.6888 0.0603948 17.5839 17.7779
2 17.9274 0.299351 17.6874 18.668
3 18.1124 0.256843 17.7715 18.706
4 30.6734 0.495057 29.5523 31.2966
5 18.155 0.258037 17.9003 18.8023
6 31.0295 0.843263 29.7408 32.6297
7 29.5857 5.72089 24.0367 38.6235
8 18.6855 0.119781 18.5606 18.9273
9 178.051 1.7719 176.371 181.136
12 19.4357 0.208011 19.1634 19.8744
13 35.6026 6.49526 28.4508 47.6523
14 17.7943 0.323826 17.5242 18.518
15 19.4828 0.371601 18.9686 20.2441
16 138.703 0.316535 138.17 139.102
17 34.0763 0.462141 32.9988 34.5786
18 125.103 2.85732 119.044 127.782
19 108.673 4.28322 102.351 118.69
20 98.7908 1.46687 96.5369 101.601
21 19.9293 0.201926 19.752 20.2957
22 17.7938 0.096477 17.6597 18.006

gz avatar Nov 09 '22 18:11 gz

1 core, 4 generator, 20M events, things are fine:

name ('elapsed', 'mean') ('elapsed', 'std') ('elapsed', 'min') ('elapsed', 'max')
0 7.70756 0.0121255 7.69477 7.73248
1 11.2202 0.0302091 11.1786 11.2689
2 7.80124 0.0169123 7.78096 7.82983
3 8.28608 0.0256119 8.2513 8.32688
4 13.9969 0.0648266 13.9327 14.1596
5 8.15482 0.014006 8.1358 8.18143
6 15.1791 0.0764544 14.987 15.2693
7 18.2843 0.0545263 18.1988 18.3711
8 8.3394 0.0280169 8.31073 8.39335
9 57.2127 0.185577 56.9653 57.5052
12 10.9224 0.028771 10.8843 10.993
13 21.2305 0.154159 20.9139 21.3537
14 11.36 0.016742 11.3369 11.3922
15 16.7001 0.0136383 16.6682 16.7181
16 119.59 0.221919 119.163 119.981
17 26.7408 0.0783941 26.6099 26.867
18 30.2228 0.213105 29.921 30.5853
19 38.9964 0.235482 38.5843 39.2574
20 24.9274 0.173651 24.7215 25.2335
21 13.0763 0.0133133 13.0526 13.0995
22 9.52553 0.0118613 9.50356 9.54437

gz avatar Nov 09 '22 20:11 gz

seems like problems start with any amount of cores >1:

2 cores, 4 generators:

name ('elapsed', 'mean') ('elapsed', 'std') ('elapsed', 'min') ('elapsed', 'max')
13 29.6543 5.1912 24.5159 34.6646
16 137.624 2.98478 129.609 139.561

3 cores, 4 generators:

name ('elapsed', 'mean') ('elapsed', 'std') ('elapsed', 'min') ('elapsed', 'max')
13 19.821 4.50558 14.9431 28.5657
18 29.6797 2.94843 26.961 34.9978

4 cores, 6 generators:

name ('elapsed', 'mean') ('elapsed', 'std') ('elapsed', 'min') ('elapsed', 'max')
13 29.9042 2.44664 22.957 30.9573
16 115.823 2.00773 113.875 119.531

gz avatar Nov 14 '22 20:11 gz

@gz and I wrote a quick bench of the Nexmark generator in isolation and it appears to have inconsistent performance depending on

  1. the setting for num_event_generators
  2. how many events have been generated so far.

In the following each line is a single-threaded NexmarkGenerator run, configured with different num_event_generators, and each datapoint in a line is the average throughput over 2M events generated, so the first datapoint is the first 2M events, the second datapoint is from event 2M to event 4M, etc. The performance seems predictable across runs, but seems to be dependent on the event ids of the events being generated.

num_event_generators = 1: 4816330.26 1144862.92 2307019.61 5137148.50 893611.64 hz
num_event_generators = 1: 5096182.48 1141302.12 2304576.70 5139885.19 896300.56 hz
num_event_generators = 1: 5077263.25 1133725.47 2285421.86 5086664.58 893150.26 hz
[...]
num_event_generators = 2: 7410578.64 1281839.59 1707982.83 7423415.87 1282788.71 hz
num_event_generators = 2: 7413553.35 1282431.70 1696598.43 7418033.05 1278006.75 hz
num_event_generators = 2: 7410929.90 1281895.98 1707797.73 7416785.99 1282502.35 hz
[...]
num_event_generators = 3: 5084027.86 1134391.37 2286589.83 5106829.56 890007.10 hz
num_event_generators = 3: 5078404.75 1135183.49 2294638.90 5126034.24 898400.62 hz
num_event_generators = 3: 5123281.64 1141378.24 2289420.31 5090503.86 895219.45 hz

The code for this is here: https://github.com/vmware/database-stream-processor/pull/282

utaal avatar Jan 27 '23 13:01 utaal

Thanks @utaal and @gz! This does seem odd to me. The Nexmark generator has all kinds of interesting throttling logic, but your benchmark isn't using any of that as it just runs one single-threaded instance of NexmarkGenerator.

@absoludity, I know it's been a while since you worked on this, but I was wondering if you have any idea what could be going on here?

ryzhyk avatar Jan 27 '23 17:01 ryzhyk

Hi there - yes sure, I'd love an excuse to look back here :)

Yes, looking at the pattern, it's consistently periodic within each set (for a specific number of generators), which smells of something that is ticking over... like a non-random generator that is incrementing and wrapping around :)

If I run the same command (thanks for including the code in the linked PR), with this change (ie. not incrementing the mock random number generator so just having a constant internally to the StepRng):

diff --git a/benches/nexmark-gen/main.rs b/benches/nexmark-gen/main.rs
index 8247934..b4a43b0 100644
--- a/benches/nexmark-gen/main.rs
+++ b/benches/nexmark-gen/main.rs
@@ -27,7 +27,7 @@ fn main() -> Result<()> {
                 },
                 ..Config::default()
             };
-            let mut generator = NexmarkGenerator::new(config, StepRng::new(0, 1), 0);
+            let mut generator = NexmarkGenerator::new(config, StepRng::new(0, 0), 0);
 
             print!("num_event_generators = {num_event_generators}: ");

I get the following, more consistent results:

cargo bench --bench nexmark-gen --features="with-nexmark"
    Finished bench [optimized + debuginfo] target(s) in 0.15s
     Running benches/nexmark-gen/main.rs (target/release/deps/nexmark_gen-eca28ca9c1692658)
num_event_generators = 1: 4536998.07 4763793.38 4688958.31 4680736.85 4622709.62 hz
num_event_generators = 1: 4851469.37 4844114.19 4682973.02 4552660.53 4560906.51 hz
num_event_generators = 1: 4492503.96 4808102.23 4663286.39 4591567.07 4440258.58 hz
[...]
num_event_generators = 2: 4423671.30 4903945.28 4875307.81 4853178.01 4884811.40 hz
num_event_generators = 2: 4684526.34 4525580.48 4375612.90 4383071.28 4275201.73 hz
num_event_generators = 2: 4582011.08 4503172.84 4825768.50 4944503.63 4853727.82 hz
[...]
num_event_generators = 3: 4937228.55 4760260.35 4799119.30 4843592.55 4763545.15 hz
num_event_generators = 3: 4853271.26 4786734.60 4727096.77 4716394.81 4785356.35 hz
num_event_generators = 3: 4791791.84 4764923.75 4836950.47 4842538.89 4725423.40 hz
[...]

So I predict that if you test the mock StepRng's implementation of gen_range (and/or any other RngCore methods we use here), that the elapsed time increases as the StepRng's internal u64 increases, until it wraps around again.

Another test would be to switch and try with the thread_rng where we expect there to be no period, just evenly distributed elapsed time. So running with:

diff --git a/benches/nexmark-gen/main.rs b/benches/nexmark-gen/main.rs
index 8247934..152d9b3 100644
--- a/benches/nexmark-gen/main.rs
+++ b/benches/nexmark-gen/main.rs
@@ -27,7 +27,7 @@ fn main() -> Result<()> {
                 },
                 ..Config::default()
             };
-            let mut generator = NexmarkGenerator::new(config, StepRng::new(0, 1), 0);
+            let mut generator = NexmarkGenerator::new(config, rand::thread_rng(), 0);
 
             print!("num_event_generators = {num_event_generators}: ");

we get consistent results as well, though see that the rng is a significant source of CPU:

cargo bench --bench nexmark-gen --features="with-nexmark"
   Compiling dbsp v0.1.0 (/home/michael/dev/vmware/database-stream-processor)
warning: unused import: `rand::rngs::mock::StepRng`
 --> benches/nexmark-gen/main.rs:6:5
  |
6 | use rand::rngs::mock::StepRng;
  |     ^^^^^^^^^^^^^^^^^^^^^^^^^
  |
  = note: `#[warn(unused_imports)]` on by default

warning: `dbsp` (bench "nexmark-gen") generated 1 warning
    Finished bench [optimized + debuginfo] target(s) in 4.47s
     Running benches/nexmark-gen/main.rs (target/release/deps/nexmark_gen-eca28ca9c1692658)
num_event_generators = 1: 1728821.68 1737448.25 1729492.38 1724809.79 1711621.66 hz
num_event_generators = 1: 1716472.95 1734710.36 1717877.07 1705638.06 1790052.31 hz
num_event_generators = 1: 1802534.80 1783204.56 1719134.73 1778285.74 1741891.40 hz
[...]
num_event_generators = 2: 1717869.98 1688106.30 1687786.88 1683899.78 1679868.42 hz
num_event_generators = 2: 1736636.88 1825340.98 1737837.61 1670449.03 1676080.63 hz
num_event_generators = 2: 1710243.81 1709950.99 1692023.62 1696556.19 1705105.06 hz
[...]
num_event_generators = 3: 1753594.91 1757055.92 1720779.28 1730545.30 1744094.04 hz
num_event_generators = 3: 1661194.03 1754594.32 1740970.11 1722975.19 1622709.08 hz
num_event_generators = 3: 1510273.14 1557155.04 1513515.73 1514255.19 1524858.61 hz
[...]

I've not tried to run this in isolation, but would be interested to see the results (as my machine has a bunch of other things running). And note that all the above is with a single core, so perhaps not surprising that the frequency isn't increasing with more generators.

The StdRng can be used for slightly better performance (or even SmallRng but looks like we've not updated to the latest rand package to use this).

The performance seems predictable across runs, but seems to be dependent on the event ids of the events being generated.

So I'm not 100%, but the above suggests that at least this last experiment is not indicating that the performance is dependent on the generated event ids so much as the StepRng's performance is perhaps dependent on its internally wrapping u64?

That said, I'm sure there may be other underlying issues that you're sniffing out. I'd be keen to understand them (for example, the original tests from @gz above are showing abnormal stdevs for certain queries with the thread_rng).

absoludity avatar Jan 30 '23 00:01 absoludity

This is an interesting conundrum!

I tried criterion but it doesn't seem to be designed for this situation: it runs the function-under-test many times to average out the measurement error, and I think we may be looking for outliers (which makes things tricky using time):

Here's some more data with some more code and my library to collect a histogram of latencies. To start, the throughput of the rngs on my machine:

(just the rng) rng = StepRng
throughput (hz) every 10000000 samples:
827472072.82 913172541.30 1055710809.56 1166339489.00 1394246474.61 1329838871.40 1416656336.88 1469597769.62 1681920847.80 1553388164.83

(just the rng) rng = SmallRng
throughput (hz) every 10000000 samples:
154736385.60 155028746.52 154952874.96 154970884.85 154027135.73 154271787.33 155077028.85 154772107.74 154346295.21 155136471.61

(just the rng) rng = ThreadRng
throughput (hz) every 10000000 samples:
64373935.32 64455675.85 64548490.31 64551702.02 64443024.45 64546927.91 64561078.95 64291833.49 64524073.40 64512418.11

StepRng is (not surprisingly) extremely fast. ThreadRng is the only one that seems like it may significantly affect the performance of next_event (from which we've seen <10M events/sec).

Next, I set up a harness that measures each invocation of next_event using Instant::now, and I tested the harness in isolation (considering that these regular timing APIs are not super-precise):

(just the time calls)
throughput (hz), every 1000000 samples:
38620352.78 39100175.16 39213764.03 38777100.54 39119294.80 38836459.67 39170563.32 39226134.27 39199098.14 39159442.57
ccdf of latencies (ns) for 10000000 samples:
╭   0.000000e0    4.000000e1    4.000000e1    4.000000e1    4.000000e1    4.000000e1    2.969600e4  ╮
|   1.000000e0    4.200000e1    4.200000e1    4.200000e1    4.200000e1    4.200000e1    3.072000e4  |
╰ [    0.25          0.5           0.75     ]--- 0.95  ------- 0.99  ------- 0.999 -------| max     ╯

We can measure up to ~40M events/sec, and the error is up to ~30 micros, but generally within 40 nanos. How to read the ccdf: the number at the bottom is the fraction, and the values on top are lower and upper bounds approximating the respective quantile. So 99.9% of the time, we saw a latency of up to 42 nanos.

Then I tried with next_event and the various rngs:

== num_event_generators = 1 ==
rng = StepRng
throughput (hz), every 1000000 samples:
6756914.68 3751443.14 2391624.06 730808.67 1620602.94 3630666.25 6770616.35 3752520.65 922144.92 863078.00
ccdf of latencies (ns) for 10000000 samples:
╭   8.000000e1    8.000000e1    1.240000e2    3.680000e2    4.160000e2    4.480000e2    2.852127e8  ╮
|   8.400000e1    8.400000e1    1.280000e2    3.840000e2    4.320000e2    4.640000e2    3.019899e8  |
╰ [    0.25          0.5           0.75     ]--- 0.95  ------- 0.99  ------- 0.999 -------| max     ╯

rng = SmallRng
throughput (hz), every 1000000 samples:
4082606.80 4025723.03 4146441.26 4153186.12 4151952.48 4160061.13 4144013.45 4166377.35 4167902.57 4175421.29
ccdf of latencies (ns) for 10000000 samples:
╭   1.600000e2    2.080000e2    2.080000e2    7.360000e2    8.640000e2    9.280000e2    5.324800e4  ╮
|   1.680000e2    2.160000e2    2.160000e2    7.680000e2    8.960000e2    9.600000e2    5.529600e4  |
╰ [    0.25          0.5           0.75     ]--- 0.95  ------- 0.99  ------- 0.999 -------| max     ╯

rng = ThreadRng
throughput (hz), every 1000000 samples:
1173083.83 1171903.78 1154941.74 1159426.73 1155707.51 1157825.62 1154580.42 1160077.33 1145684.36 1152294.38
ccdf of latencies (ns) for 10000000 samples:
╭   5.120000e2    5.760000e2    8.320000e2    3.072000e3    3.840000e3    4.096000e3    6.553600e4  ╮
|   5.440000e2    6.080000e2    8.640000e2    3.200000e3    3.968000e3    4.352000e3    6.963200e4  |
╰ [    0.25          0.5           0.75     ]--- 0.95  ------- 0.99  ------- 0.999 -------| max     ╯

The results for SmallRng and ThreadRng are consistent and they both take up to ~70 micros, but it seems like next_event with StepRng very rarely (<.1% of the time) takes up to 300 milliseconds, which would explain the inconsistencies between the various batches of next_event in our previous experiments.

To double-check, I collected latencies for 100 calls to StepRng::gen_range(0..1024), and got:

ccdf of latencies (ns) for 1000000 samples of batches of 100 calls:
╭   4.000000e1    4.000000e1    4.000000e1    4.000000e1    8.000000e1    1.240000e2    6.553600e5  ╮
|   4.200000e1    4.200000e1    4.200000e1    4.200000e1    8.400000e1    1.280000e2    6.881280e5  |
╰ [    0.25          0.5           0.75     ]--- 0.95  ------- 0.99  ------- 0.999 -------| max     ╯

where the max is 680 micros, which is significantly lower than those 300 milliseconds.

My best guess here is that it isn't that StepRng is slow sometimes, but that it produces sequences that cause next_event to be slow (maybe it makes next_event do more work sometimes?). That would mean that StepRng isn't the issue per se but it is a poor choice to run any benchmarks with the Nexmark generator.

utaal avatar Jan 30 '23 14:01 utaal

Wow, thanks for the detailed analysis, @absoludity and @utaal. My takeaways:

  • The original anomaly reported by @utaal is caused by the use of StepRng, since the Nexmark generator expects even distribution.
  • I think the best way to eliminate throughput fluctuations related to the generator is to pre-generate the entire input sequence before feeding it to DBSP. The generator is pretty complicated because of the semantics of the Nexmark benchmark, so it may be too much of a challenge to make its performance predictable. Even if we do, generator threads may still interfere with DBSP threads. A better solution is to add a new benchmarking mode where the generator runs to completion, generating a complete event sequence, possibly even partitioned into batches, which are pushed to DBSP during the second phase. This is not suitable for comparative benchmarking against Flink or other Nexmark implementations, but it should be good enough for catching DBSP performance regressions.

ryzhyk avatar Jan 30 '23 16:01 ryzhyk

I also looked into this the other day and my best guess is that StepRng's distribution is such that it creates skewed distributions when we call next_string() or next_extra(), occasionally creating much longer strings that have to be filled with random data and heap allocated (twice).

If we use an rng that doesn't have a terrible distribution but is still deterministic (like I do in some of our other benchmarks) it should be a non-issue since every run's event generation will be identical (xoshiro or PCG-64 would be my recommendations).

Kixiron avatar Jan 30 '23 20:01 Kixiron