Sneaky branch-predictor remembering inputs over benchmark loops
We recently discovered in https://github.com/JuliaLang/julia/pull/29888 a somewhat surprising fact about modern CPU: The branch-predictor is capable of remembering astoundingly long periodic patterns. If we run a benchmark loop, then each evaluation and possibly each sample will have identical branching patterns, thus introducing a period. In other words, the sneaky little CPU learns our sample set and we have an emergent defeat device for some of our benchmarks.
At least the findall benchmarks are broken, and probably have been broken forever. I suspect that the logical indexing benchmarks are broken as well. But we should really go over all our benchmarks and figure out which ones are affected. Also, this is interesting and something to keep in mind for all our benchmarks. Indirect branch prediction (the BTB) and D-cache are something to keep in mind as well.
The likely fix is to increase the size of testsets. Long-term it would be cool to parametrize all benchmarks, and occasionally (rarely) run regression tests on our regression tests: Check that everything has the expected scaling behavior, and explain or fix surprises. Alternatively, we could regenerate new random data between runs. But afaik BenchmarkTools has no support for that (would need new feature to fix evals/sample = 1).
Demo:
julia> using Printf, BenchmarkTools
julia> function cpu_speed_ghz()
# if available, use reported CPU speed instead of current speed (which
# may be inaccurate if the CPU is idling)
cpu_info = Sys.cpu_info()[1]
m = match(r"([\d\.]+)GHz", cpu_info.model)
ghz = m ≡ nothing ? cpu_info.speed / 1000 : parse(Float64, m.captures[1])
end;
julia> const CPU_SPEED_GHZ = cpu_speed_ghz();
julia> const cpu_model = Sys.cpu_info()[1].model;
julia> begin
N=30_000
list = fill(false, N); list[1:2:end].=true;
bt0 = @belapsed findall($list)
list .= rand(Bool, N)
btL = @belapsed findall($list)
time_to_cycle = 10^9/N * CPU_SPEED_GHZ
penalty = 2*(btL-bt0)*time_to_cycle
@printf("\n\n%s; branch-miss penalty: %4.1f ns = %4.1f cycles\n\n",
cpu_model, penalty/CPU_SPEED_GHZ , penalty)
bt = bt0
@printf("Period %5d: %7.2f us = %7.2f cycles per idx. Miss-rate %5.2f%%\n",
2, bt*10^6, bt*time_to_cycle, 100*(bt - bt0) *time_to_cycle / penalty )
for n=[100, 500, 1000, 2000, 2500, 3000, 5000, 10_000, 30_000]
pat = rand(Bool, n)
for i=1:n:N list[i:(i+n-1)].=pat end
bt = @belapsed findall($list)
@printf("Period %5d: %7.2f us = %7.2f cycles per idx. Miss-rate %5.2f%%\n",
n, bt*10^6, bt*time_to_cycle, 100*(bt - bt0) *time_to_cycle / penalty )
end
end;
yielding:
Intel(R) Core(TM) i5-5###U CPU @ 2.00GHz; branch-miss penalty: 9.9 ns = 19.8 cycles
Period 2: 44.81 us = 2.99 cycles per idx. Miss-rate 0.00%
Period 100: 53.22 us = 3.55 cycles per idx. Miss-rate 2.83%
Period 500: 51.52 us = 3.43 cycles per idx. Miss-rate 2.26%
Period 1000: 51.37 us = 3.42 cycles per idx. Miss-rate 2.21%
Period 2000: 57.85 us = 3.86 cycles per idx. Miss-rate 4.39%
Period 2500: 88.66 us = 5.91 cycles per idx. Miss-rate 14.77%
Period 3000: 121.78 us = 8.12 cycles per idx. Miss-rate 25.93%
Period 5000: 159.28 us = 10.62 cycles per idx. Miss-rate 38.56%
Period 10000: 182.87 us = 12.19 cycles per idx. Miss-rate 46.51%
Period 30000: 192.51 us = 12.83 cycles per idx. Miss-rate 49.75%
This is compatible with Agner Fog's tables. And it is absolutely mindboggling that the CPU manages to completely defeat patterns of length 2000. If you have a different CPU-Arch available (Ryzen? Skylake? Power?), then please post similar figures. We should increase testset sizes above the BHT limits for all realistic current and near-future CPUs.
JuliaBox's CPU gives a similar cutoff:
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz; branch-miss penalty: 6.3 ns = 14.6 cycles
Period 2: 28.40 us = 2.18 cycles per idx. Miss-rate 0.00%
Period 100: 32.50 us = 2.49 cycles per idx. Miss-rate 2.16%
Period 500: 32.30 us = 2.48 cycles per idx. Miss-rate 2.05%
Period 1000: 32.70 us = 2.51 cycles per idx. Miss-rate 2.26%
Period 2000: 33.40 us = 2.56 cycles per idx. Miss-rate 2.63%
Period 2500: 42.70 us = 3.27 cycles per idx. Miss-rate 7.52%
Period 3000: 71.90 us = 5.51 cycles per idx. Miss-rate 22.87%
Period 5000: 102.20 us = 7.84 cycles per idx. Miss-rate 38.80%
Period 10000: 116.70 us = 8.95 cycles per idx. Miss-rate 46.42%
Period 30000: 123.40 us = 9.46 cycles per idx. Miss-rate 49.95%
I get similar numbers/cutoff on a Skylake:
Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz; branch-miss penalty: 6.4 ns = 18.6 cycles
Period 2: 28.22 us = 2.73 cycles per idx. Miss-rate 0.00%
Period 100: 36.35 us = 3.51 cycles per idx. Miss-rate 4.21%
Period 500: 27.39 us = 2.65 cycles per idx. Miss-rate -0.43%
Period 1000: 29.46 us = 2.85 cycles per idx. Miss-rate 0.64%
Period 2000: 39.03 us = 3.77 cycles per idx. Miss-rate 5.60%
Period 2500: 67.03 us = 6.48 cycles per idx. Miss-rate 20.12%
Period 3000: 82.83 us = 8.01 cycles per idx. Miss-rate 28.31%
Period 5000: 112.05 us = 10.83 cycles per idx. Miss-rate 43.47%
Period 10000: 116.88 us = 11.30 cycles per idx. Miss-rate 45.97%
Period 30000: 124.50 us = 12.04 cycles per idx. Miss-rate 49.92%
These numbers are very consistent between runs; in particular the discrepancy for period 100 (slower than periods 500 and 1000).
I see a monotonic increase on every run. Jump point is the same on each run.
Intel(R) Celeron(R) CPU G550T @ 2.20GHz; branch-miss penalty: 10.0 ns = 21.9 cycles
Period 2: 65.21 us = 4.78 cycles per idx. Miss-rate 0.00%
Period 100: 67.70 us = 4.96 cycles per idx. Miss-rate 0.84%
Period 500: 68.99 us = 5.06 cycles per idx. Miss-rate 1.27%
Period 1000: 71.57 us = 5.25 cycles per idx. Miss-rate 2.13%
Period 2000: 82.29 us = 6.03 cycles per idx. Miss-rate 5.72%
Period 2500: 98.69 us = 7.24 cycles per idx. Miss-rate 11.21%
Period 3000: 121.89 us = 8.94 cycles per idx. Miss-rate 18.98%
Period 5000: 169.03 us = 12.40 cycles per idx. Miss-rate 34.77%
Period 10000: 198.12 us = 14.53 cycles per idx. Miss-rate 44.51%
Period 30000: 215.11 us = 15.78 cycles per idx. Miss-rate 50.21%