database-stream-processor nexmark benchmark have unstable performance

here are the worst offenders (stdev > 2) using 10 runs on the CI machine same invocation as the ci.bash script:

query	('elapsed', 'mean')	('elapsed', 'std')	('elapsed', 'min')	('elapsed', 'max')
7	29.5857	5.72089	24.0367	38.6235
13	35.6026	6.49526	28.4508	47.6523
18	125.103	2.85732	119.044	127.782
19	108.673	4.28322	102.351	118.69

Nov 09 '22 18:11 gz

all of the queries for completeness

query	('elapsed', 'mean')	('elapsed', 'std')	('elapsed', 'min')	('elapsed', 'max')
0	17.8558	0.225236	17.704	18.4677
1	17.6888	0.0603948	17.5839	17.7779
2	17.9274	0.299351	17.6874	18.668
3	18.1124	0.256843	17.7715	18.706
4	30.6734	0.495057	29.5523	31.2966
5	18.155	0.258037	17.9003	18.8023
6	31.0295	0.843263	29.7408	32.6297
7	29.5857	5.72089	24.0367	38.6235
8	18.6855	0.119781	18.5606	18.9273
9	178.051	1.7719	176.371	181.136
12	19.4357	0.208011	19.1634	19.8744
13	35.6026	6.49526	28.4508	47.6523
14	17.7943	0.323826	17.5242	18.518
15	19.4828	0.371601	18.9686	20.2441
16	138.703	0.316535	138.17	139.102
17	34.0763	0.462141	32.9988	34.5786
18	125.103	2.85732	119.044	127.782
19	108.673	4.28322	102.351	118.69
20	98.7908	1.46687	96.5369	101.601
21	19.9293	0.201926	19.752	20.2957
22	17.7938	0.096477	17.6597	18.006

Nov 09 '22 18:11 gz

1 core, 4 generator, 20M events, things are fine:

name	('elapsed', 'mean')	('elapsed', 'std')	('elapsed', 'min')	('elapsed', 'max')
0	7.70756	0.0121255	7.69477	7.73248
1	11.2202	0.0302091	11.1786	11.2689
2	7.80124	0.0169123	7.78096	7.82983
3	8.28608	0.0256119	8.2513	8.32688
4	13.9969	0.0648266	13.9327	14.1596
5	8.15482	0.014006	8.1358	8.18143
6	15.1791	0.0764544	14.987	15.2693
7	18.2843	0.0545263	18.1988	18.3711
8	8.3394	0.0280169	8.31073	8.39335
9	57.2127	0.185577	56.9653	57.5052
12	10.9224	0.028771	10.8843	10.993
13	21.2305	0.154159	20.9139	21.3537
14	11.36	0.016742	11.3369	11.3922
15	16.7001	0.0136383	16.6682	16.7181
16	119.59	0.221919	119.163	119.981
17	26.7408	0.0783941	26.6099	26.867
18	30.2228	0.213105	29.921	30.5853
19	38.9964	0.235482	38.5843	39.2574
20	24.9274	0.173651	24.7215	25.2335
21	13.0763	0.0133133	13.0526	13.0995
22	9.52553	0.0118613	9.50356	9.54437

Nov 09 '22 20:11 gz

seems like problems start with any amount of cores >1:

2 cores, 4 generators:

name	('elapsed', 'mean')	('elapsed', 'std')	('elapsed', 'min')	('elapsed', 'max')
13	29.6543	5.1912	24.5159	34.6646
16	137.624	2.98478	129.609	139.561

3 cores, 4 generators:

name	('elapsed', 'mean')	('elapsed', 'std')	('elapsed', 'min')	('elapsed', 'max')
13	19.821	4.50558	14.9431	28.5657
18	29.6797	2.94843	26.961	34.9978

4 cores, 6 generators:

name	('elapsed', 'mean')	('elapsed', 'std')	('elapsed', 'min')	('elapsed', 'max')
13	29.9042	2.44664	22.957	30.9573
16	115.823	2.00773	113.875	119.531

Nov 14 '22 20:11 gz

@gz and I wrote a quick bench of the Nexmark generator in isolation and it appears to have inconsistent performance depending on

the setting for num_event_generators
how many events have been generated so far.

In the following each line is a single-threaded NexmarkGenerator run, configured with different num_event_generators, and each datapoint in a line is the average throughput over 2M events generated, so the first datapoint is the first 2M events, the second datapoint is from event 2M to event 4M, etc. The performance seems predictable across runs, but seems to be dependent on the event ids of the events being generated.

num_event_generators = 1: 4816330.26 1144862.92 2307019.61 5137148.50 893611.64 hz
num_event_generators = 1: 5096182.48 1141302.12 2304576.70 5139885.19 896300.56 hz
num_event_generators = 1: 5077263.25 1133725.47 2285421.86 5086664.58 893150.26 hz
[...]
num_event_generators = 2: 7410578.64 1281839.59 1707982.83 7423415.87 1282788.71 hz
num_event_generators = 2: 7413553.35 1282431.70 1696598.43 7418033.05 1278006.75 hz
num_event_generators = 2: 7410929.90 1281895.98 1707797.73 7416785.99 1282502.35 hz
[...]
num_event_generators = 3: 5084027.86 1134391.37 2286589.83 5106829.56 890007.10 hz
num_event_generators = 3: 5078404.75 1135183.49 2294638.90 5126034.24 898400.62 hz
num_event_generators = 3: 5123281.64 1141378.24 2289420.31 5090503.86 895219.45 hz

The code for this is here: https://github.com/vmware/database-stream-processor/pull/282

Jan 27 '23 13:01 utaal

Thanks @utaal and @gz! This does seem odd to me. The Nexmark generator has all kinds of interesting throttling logic, but your benchmark isn't using any of that as it just runs one single-threaded instance of NexmarkGenerator.

@absoludity, I know it's been a while since you worked on this, but I was wondering if you have any idea what could be going on here?

Jan 27 '23 17:01 ryzhyk

Hi there - yes sure, I'd love an excuse to look back here :)

Yes, looking at the pattern, it's consistently periodic within each set (for a specific number of generators), which smells of something that is ticking over... like a non-random generator that is incrementing and wrapping around :)

If I run the same command (thanks for including the code in the linked PR), with this change (ie. not incrementing the mock random number generator so just having a constant internally to the StepRng):

diff --git a/benches/nexmark-gen/main.rs b/benches/nexmark-gen/main.rs
index 8247934..b4a43b0 100644
--- a/benches/nexmark-gen/main.rs
+++ b/benches/nexmark-gen/main.rs
@@ -27,7 +27,7 @@ fn main() -> Result<()> {
                 },
                 ..Config::default()
             };
-            let mut generator = NexmarkGenerator::new(config, StepRng::new(0, 1), 0);
+            let mut generator = NexmarkGenerator::new(config, StepRng::new(0, 0), 0);
 
             print!("num_event_generators = {num_event_generators}: ");

I get the following, more consistent results:

cargo bench --bench nexmark-gen --features="with-nexmark"
    Finished bench [optimized + debuginfo] target(s) in 0.15s
     Running benches/nexmark-gen/main.rs (target/release/deps/nexmark_gen-eca28ca9c1692658)
num_event_generators = 1: 4536998.07 4763793.38 4688958.31 4680736.85 4622709.62 hz
num_event_generators = 1: 4851469.37 4844114.19 4682973.02 4552660.53 4560906.51 hz
num_event_generators = 1: 4492503.96 4808102.23 4663286.39 4591567.07 4440258.58 hz
[...]
num_event_generators = 2: 4423671.30 4903945.28 4875307.81 4853178.01 4884811.40 hz
num_event_generators = 2: 4684526.34 4525580.48 4375612.90 4383071.28 4275201.73 hz
num_event_generators = 2: 4582011.08 4503172.84 4825768.50 4944503.63 4853727.82 hz
[...]
num_event_generators = 3: 4937228.55 4760260.35 4799119.30 4843592.55 4763545.15 hz
num_event_generators = 3: 4853271.26 4786734.60 4727096.77 4716394.81 4785356.35 hz
num_event_generators = 3: 4791791.84 4764923.75 4836950.47 4842538.89 4725423.40 hz
[...]

So I predict that if you test the mock StepRng's implementation of gen_range (and/or any other RngCore methods we use here), that the elapsed time increases as the StepRng's internal u64 increases, until it wraps around again.

Another test would be to switch and try with the thread_rng where we expect there to be no period, just evenly distributed elapsed time. So running with:

diff --git a/benches/nexmark-gen/main.rs b/benches/nexmark-gen/main.rs
index 8247934..152d9b3 100644
--- a/benches/nexmark-gen/main.rs
+++ b/benches/nexmark-gen/main.rs
@@ -27,7 +27,7 @@ fn main() -> Result<()> {
                 },
                 ..Config::default()
             };
-            let mut generator = NexmarkGenerator::new(config, StepRng::new(0, 1), 0);
+            let mut generator = NexmarkGenerator::new(config, rand::thread_rng(), 0);
 
             print!("num_event_generators = {num_event_generators}: ");

we get consistent results as well, though see that the rng is a significant source of CPU:

cargo bench --bench nexmark-gen --features="with-nexmark"
   Compiling dbsp v0.1.0 (/home/michael/dev/vmware/database-stream-processor)
warning: unused import: `rand::rngs::mock::StepRng`
 --> benches/nexmark-gen/main.rs:6:5
  |
6 | use rand::rngs::mock::StepRng;
  |     ^^^^^^^^^^^^^^^^^^^^^^^^^
  |
  = note: `#[warn(unused_imports)]` on by default

warning: `dbsp` (bench "nexmark-gen") generated 1 warning
    Finished bench [optimized + debuginfo] target(s) in 4.47s
     Running benches/nexmark-gen/main.rs (target/release/deps/nexmark_gen-eca28ca9c1692658)
num_event_generators = 1: 1728821.68 1737448.25 1729492.38 1724809.79 1711621.66 hz
num_event_generators = 1: 1716472.95 1734710.36 1717877.07 1705638.06 1790052.31 hz
num_event_generators = 1: 1802534.80 1783204.56 1719134.73 1778285.74 1741891.40 hz
[...]
num_event_generators = 2: 1717869.98 1688106.30 1687786.88 1683899.78 1679868.42 hz
num_event_generators = 2: 1736636.88 1825340.98 1737837.61 1670449.03 1676080.63 hz
num_event_generators = 2: 1710243.81 1709950.99 1692023.62 1696556.19 1705105.06 hz
[...]
num_event_generators = 3: 1753594.91 1757055.92 1720779.28 1730545.30 1744094.04 hz
num_event_generators = 3: 1661194.03 1754594.32 1740970.11 1722975.19 1622709.08 hz
num_event_generators = 3: 1510273.14 1557155.04 1513515.73 1514255.19 1524858.61 hz
[...]

I've not tried to run this in isolation, but would be interested to see the results (as my machine has a bunch of other things running). And note that all the above is with a single core, so perhaps not surprising that the frequency isn't increasing with more generators.

The StdRng can be used for slightly better performance (or even SmallRng but looks like we've not updated to the latest rand package to use this).

The performance seems predictable across runs, but seems to be dependent on the event ids of the events being generated.

So I'm not 100%, but the above suggests that at least this last experiment is not indicating that the performance is dependent on the generated event ids so much as the StepRng's performance is perhaps dependent on its internally wrapping u64?

That said, I'm sure there may be other underlying issues that you're sniffing out. I'd be keen to understand them (for example, the original tests from @gz above are showing abnormal stdevs for certain queries with the thread_rng).

Jan 30 '23 00:01 absoludity

This is an interesting conundrum!

I tried criterion but it doesn't seem to be designed for this situation: it runs the function-under-test many times to average out the measurement error, and I think we may be looking for outliers (which makes things tricky using time):

Here's some more data with some more code and my library to collect a histogram of latencies. To start, the throughput of the rngs on my machine:

(just the rng) rng = StepRng
throughput (hz) every 10000000 samples:
827472072.82 913172541.30 1055710809.56 1166339489.00 1394246474.61 1329838871.40 1416656336.88 1469597769.62 1681920847.80 1553388164.83

(just the rng) rng = SmallRng
throughput (hz) every 10000000 samples:
154736385.60 155028746.52 154952874.96 154970884.85 154027135.73 154271787.33 155077028.85 154772107.74 154346295.21 155136471.61

(just the rng) rng = ThreadRng
throughput (hz) every 10000000 samples:
64373935.32 64455675.85 64548490.31 64551702.02 64443024.45 64546927.91 64561078.95 64291833.49 64524073.40 64512418.11

StepRng is (not surprisingly) extremely fast. ThreadRng is the only one that seems like it may significantly affect the performance of next_event (from which we've seen <10M events/sec).

Next, I set up a harness that measures each invocation of next_event using Instant::now, and I tested the harness in isolation (considering that these regular timing APIs are not super-precise):

(just the time calls)
throughput (hz), every 1000000 samples:
38620352.78 39100175.16 39213764.03 38777100.54 39119294.80 38836459.67 39170563.32 39226134.27 39199098.14 39159442.57
ccdf of latencies (ns) for 10000000 samples:
╭   0.000000e0    4.000000e1    4.000000e1    4.000000e1    4.000000e1    4.000000e1    2.969600e4  ╮
|   1.000000e0    4.200000e1    4.200000e1    4.200000e1    4.200000e1    4.200000e1    3.072000e4  |
╰ [    0.25          0.5           0.75     ]--- 0.95  ------- 0.99  ------- 0.999 -------| max     ╯

We can measure up to ~40M events/sec, and the error is up to ~30 micros, but generally within 40 nanos. How to read the ccdf: the number at the bottom is the fraction, and the values on top are lower and upper bounds approximating the respective quantile. So 99.9% of the time, we saw a latency of up to 42 nanos.

Then I tried with next_event and the various rngs:

== num_event_generators = 1 ==
rng = StepRng
throughput (hz), every 1000000 samples:
6756914.68 3751443.14 2391624.06 730808.67 1620602.94 3630666.25 6770616.35 3752520.65 922144.92 863078.00
ccdf of latencies (ns) for 10000000 samples:
╭   8.000000e1    8.000000e1    1.240000e2    3.680000e2    4.160000e2    4.480000e2    2.852127e8  ╮
|   8.400000e1    8.400000e1    1.280000e2    3.840000e2    4.320000e2    4.640000e2    3.019899e8  |
╰ [    0.25          0.5           0.75     ]--- 0.95  ------- 0.99  ------- 0.999 -------| max     ╯

rng = SmallRng
throughput (hz), every 1000000 samples:
4082606.80 4025723.03 4146441.26 4153186.12 4151952.48 4160061.13 4144013.45 4166377.35 4167902.57 4175421.29
ccdf of latencies (ns) for 10000000 samples:
╭   1.600000e2    2.080000e2    2.080000e2    7.360000e2    8.640000e2    9.280000e2    5.324800e4  ╮
|   1.680000e2    2.160000e2    2.160000e2    7.680000e2    8.960000e2    9.600000e2    5.529600e4  |
╰ [    0.25          0.5           0.75     ]--- 0.95  ------- 0.99  ------- 0.999 -------| max     ╯

rng = ThreadRng
throughput (hz), every 1000000 samples:
1173083.83 1171903.78 1154941.74 1159426.73 1155707.51 1157825.62 1154580.42 1160077.33 1145684.36 1152294.38
ccdf of latencies (ns) for 10000000 samples:
╭   5.120000e2    5.760000e2    8.320000e2    3.072000e3    3.840000e3    4.096000e3    6.553600e4  ╮
|   5.440000e2    6.080000e2    8.640000e2    3.200000e3    3.968000e3    4.352000e3    6.963200e4  |
╰ [    0.25          0.5           0.75     ]--- 0.95  ------- 0.99  ------- 0.999 -------| max     ╯

The results for SmallRng and ThreadRng are consistent and they both take up to ~70 micros, but it seems like next_event with StepRng very rarely (<.1% of the time) takes up to 300 milliseconds, which would explain the inconsistencies between the various batches of next_event in our previous experiments.

To double-check, I collected latencies for 100 calls to StepRng::gen_range(0..1024), and got:

ccdf of latencies (ns) for 1000000 samples of batches of 100 calls:
╭   4.000000e1    4.000000e1    4.000000e1    4.000000e1    8.000000e1    1.240000e2    6.553600e5  ╮
|   4.200000e1    4.200000e1    4.200000e1    4.200000e1    8.400000e1    1.280000e2    6.881280e5  |
╰ [    0.25          0.5           0.75     ]--- 0.95  ------- 0.99  ------- 0.999 -------| max     ╯

where the max is 680 micros, which is significantly lower than those 300 milliseconds.

My best guess here is that it isn't that StepRng is slow sometimes, but that it produces sequences that cause next_event to be slow (maybe it makes next_event do more work sometimes?). That would mean that StepRng isn't the issue per se but it is a poor choice to run any benchmarks with the Nexmark generator.

Jan 30 '23 14:01 utaal

Wow, thanks for the detailed analysis, @absoludity and @utaal. My takeaways:

The original anomaly reported by @utaal is caused by the use of StepRng, since the Nexmark generator expects even distribution.
I think the best way to eliminate throughput fluctuations related to the generator is to pre-generate the entire input sequence before feeding it to DBSP. The generator is pretty complicated because of the semantics of the Nexmark benchmark, so it may be too much of a challenge to make its performance predictable. Even if we do, generator threads may still interfere with DBSP threads. A better solution is to add a new benchmarking mode where the generator runs to completion, generating a complete event sequence, possibly even partitioned into batches, which are pushed to DBSP during the second phase. This is not suitable for comparative benchmarking against Flink or other Nexmark implementations, but it should be good enough for catching DBSP performance regressions.

Jan 30 '23 16:01 ryzhyk

I also looked into this the other day and my best guess is that StepRng's distribution is such that it creates skewed distributions when we call next_string() or next_extra(), occasionally creating much longer strings that have to be filled with random data and heap allocated (twice).

If we use an rng that doesn't have a terrible distribution but is still deterministic (like I do in some of our other benchmarks) it should be a non-issue since every run's event generation will be identical (xoshiro or PCG-64 would be my recommendations).

Jan 30 '23 20:01 Kixiron