FastBroadcast.jl icon indicating copy to clipboard operation
FastBroadcast.jl copied to clipboard

Speedup _observed_ with dynamic broadcasting

Open navidcy opened this issue 2 years ago • 5 comments

Despite the claims in the README, I actually get:

julia> b = [1.0];

julia> @btime foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  45.125 μs (0 allocations: 0 bytes)

julia> @btime fast_foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  18.375 μs (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161 (2022-11-14 20:14 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.6.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 6 on 8 virtual cores
Environment:
  JULIA_EDITOR = code

navidcy avatar Nov 20 '22 10:11 navidcy

What's the problem? Your fast_foo9 is over 2x faster.

EDIT: oh, even when broadcasting b. Huh.

chriselrod avatar Nov 20 '22 12:11 chriselrod

FWIW, I got

julia> using FastBroadcast

julia> function fast_foo9(a, b, c, d, e, f, g, h, i)
           @.. a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
           nothing
       end
fast_foo9 (generic function with 1 method)

julia> function foo9(a, b, c, d, e, f, g, h, i)
           @. a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
           nothing
       end
foo9 (generic function with 1 method)

julia> a, b, c, d, e, f, g, h, i = [rand(100, 100, 2) for i in 1:9];

julia> using BenchmarkTools

julia> @btime fast_foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  38.674 μs (0 allocations: 0 bytes)

julia> @btime foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  83.503 μs (0 allocations: 0 bytes)

julia> b = [1.0];

julia> @btime foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  85.732 μs (0 allocations: 0 bytes)

julia> @btime fast_foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  30.452 μs (0 allocations: 0 bytes)

So I can reproduce.

chriselrod avatar Nov 20 '22 12:11 chriselrod

Comparing 30k evaluations, where b is fullsize and bs is the small version:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(fast_foo9, 30_000, a, bs, c, d, e, f, g, h, i)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.60e+09   49.9%  #  3.6 cycles per ns
┌ instructions             2.96e+09   50.0%  #  0.8 insns per cycle
│ branch-instructions      2.27e+08   50.0%  #  7.7% of insns
└ branch-misses            1.85e+06   50.0%  #  0.8% of branch insns
┌ task-clock               1.01e+09  100.0%  #  1.0 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              4.00e+00  100.0%
┌ L1-dcache-load-misses    6.09e+08   25.0%  # 48.6% of dcache loads
│ L1-dcache-loads          1.25e+09   25.0%
└ L1-icache-load-misses    8.45e+06   25.0%
┌ dTLB-load-misses         1.23e+05   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.25e+09   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(fast_foo9, 30_000, a, b, c, d, e, f, g, h, i)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.90e+09   49.9%  #  3.6 cycles per ns
┌ instructions             1.43e+09   50.0%  #  0.4 insns per cycle
│ branch-instructions      7.52e+07   50.0%  #  5.3% of insns
└ branch-misses            3.01e+04   50.0%  #  0.0% of branch insns
┌ task-clock               1.09e+09  100.0%  #  1.1 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    6.76e+08   25.0%  # 112.4% of dcache loads
│ L1-dcache-loads          6.02e+08   25.0%
└ L1-icache-load-misses    1.71e+04   25.0%
┌ dTLB-load-misses         4.01e+00   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               6.02e+08   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(foo9, 30_000, a, b, c, d, e, f, g, h, i)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               9.86e+09   50.0%  #  3.8 cycles per ns
┌ instructions             3.07e+10   50.0%  #  3.1 insns per cycle
│ branch-instructions      6.37e+08   50.0%  #  2.1% of insns
└ branch-misses            6.58e+06   50.0%  #  1.0% of branch insns
┌ task-clock               2.59e+09  100.0%  #  2.6 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              4.80e+01  100.0%
┌ L1-dcache-load-misses    6.80e+08   25.0%  #  5.5% of dcache loads
│ L1-dcache-loads          1.24e+10   25.0%
└ L1-icache-load-misses    1.13e+06   25.0%
┌ dTLB-load-misses         7.47e+03   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.24e+10   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(foo9, 30_000, a, bs, c, d, e, f, g, h, i)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.12e+10   50.0%  #  3.8 cycles per ns
┌ instructions             3.18e+10   50.0%  #  2.8 insns per cycle
│ branch-instructions      8.62e+08   50.0%  #  2.7% of insns
└ branch-misses            9.41e+06   50.0%  #  1.1% of branch insns
┌ task-clock               2.93e+09  100.0%  #  2.9 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.26e+03  100.0%
┌ L1-dcache-load-misses    6.16e+08   25.0%  #  4.3% of dcache loads
│ L1-dcache-loads          1.45e+10   25.0%
└ L1-icache-load-misses    1.59e+07   25.0%
┌ dTLB-load-misses         2.51e+05   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.45e+10   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

It needs twice as many instructions for the small b, but performance is totally dominated by memory bandwidth so it doesn't really matter. While regular foo9 requires 10-20x the instructions for some reason.

chriselrod avatar Nov 20 '22 13:11 chriselrod

Should we fix this inaccuracy by inserting a sleep call in the dynamic broadcasting branch?

YingboMa avatar Nov 21 '22 03:11 YingboMa

Should we fix this inaccuracy by inserting a sleep call in the dynamic broadcasting branch?

Probably better to update the README instead, as the README claims FastBroadcast is slower than base broadcasting for dynamic broadcasts.

chriselrod avatar Nov 21 '22 05:11 chriselrod