futhark
futhark copied to clipboard
Short circuiting
Is this supposed to become Z3-free?
Yes
Everything has been merged!
This whole thing is a big ugly hairball and I am not proud of it. At some point I want to do a big rewrite: I think simplifying the pass to only work on one short-circuiting opportunity at a time (which is also how the algorithm is described in our paper) will do a lot to clean this code up. However, realistically, that is not going to happen anytime soon.
What I will do is to go over this PR and clean out the most obviously egregious bits (old comments, commented-out code and so on). After that is done, I suggest we merge the PR and hope that I can come back to it in the future. If all else fails, it shouldn't be hard to disable or remove the short-circuiting pass again at a later point.
I'll let you know once I've done some house cleaning. I might save it for next week.
Sounds good! I know we took out the flashiest Z3-based parts, but I'm curious whether we'll see speedups on other benchmarks.
Sounds good! I know we took out the flashiest Z3-based parts, but I'm curious whether we'll see speedups on other benchmarks.
I'll try it out on hpa01
Here's the impact of short-circuiting on the A100 with 1000 runs
./accelerate/canny/canny.fut
data/lena256.in: 0.94x
data/lena512.in: 0.97x
./accelerate/crystal/crystal.fut
#0 ("200i32 30.0f32 5i32 1i32 1.0f32"): 0.91x
#4 ("2000i32 30.0f32 50i32 1i32 1.0f32"): 1.00x
#5 ("4000i32 30.0f32 50i32 1i32 1.0f32"): 1.00x
./accelerate/fft/fft.fut
data/1024x1024.in: 0.96x
data/128x128.in: 1.27x
data/128x512.in: 0.91x
data/256x256.in: 0.96x
data/512x512.in: 1.02x
data/64x256.in: 0.93x
./accelerate/fluid/fluid.fut
benchmarking/medium.in: 0.99x
./accelerate/hashcat/hashcat.fut
rockyou.dataset: 1.00x
./accelerate/kmeans/kmeans.fut
data/k5_n200000.in: 1.01x
data/k5_n50000.in: 0.85x
data/trivial.in: 0.95x
./accelerate/mandelbrot/mandelbrot.fut
#0 ("800i32 600i32 -0.7f32 0.0f32 3.067f32 100i32 16.0f..."): 0.99x
#1 ("1000i32 1000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."): 0.79x
#2 ("2000i32 2000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."): 0.95x
#3 ("4000i32 4000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."): 1.00x
#4 ("8000i32 8000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."): 1.00x
./accelerate/nbody/nbody-bh.fut
data/1000-bodies.in: 0.81x
data/10000-bodies.in: 0.86x
data/100000-bodies.in: 0.92x
./accelerate/nbody/nbody.fut
data/1000-bodies.in: 0.77x
data/10000-bodies.in: 1.00x
data/100000-bodies.in: 1.00x
./accelerate/pagerank/pagerank.fut
data/random_medium.in: 1.00x
data/small.in: 0.97x
./accelerate/ray/trace.fut
#0 ("800i32 600i32 100i32 50.0f32 -100.0f32 -700.0f32 1..."): 0.88x
./accelerate/smoothlife/smoothlife.fut
#0 ("128i32"): 0.95x
#1 ("256i32"): 1.01x
#2 ("512i32"): 1.00x
#3 ("1024i32"): 1.00x
./accelerate/tunnel/tunnel.fut
#0 ("10.0f32 800i32 600i32"): 0.99x
#1 ("10.0f32 1000i32 1000i32"): 1.03x
#2 ("10.0f32 2000i32 2000i32"): 1.00x
#3 ("10.0f32 4000i32 4000i32"): 1.00x
#4 ("10.0f32 8000i32 8000i32"): 1.00x
./finpar/LocVolCalib.fut
LocVolCalib-data/large.in: 1.00x
LocVolCalib-data/medium.in: 1.00x
LocVolCalib-data/small.in: 0.84x
./finpar/OptionPricing.fut
OptionPricing-data/large.in: 1.19x (mem: 0.94x@device)
OptionPricing-data/medium.in: 1.07x (mem: 0.99x@device)
OptionPricing-data/small.in: 1.10x
./jgf/crypt/crypt.fut
crypt-data/medium.in: 0.91x
./jgf/crypt/keys.fut
crypt-data/userkey0.txt: 1.68x (mem: 0.97x@device)
./jgf/series/series.fut
data/10000.in: 0.99x (mem: 0.50x@device)
data/100000.in: 1.00x (mem: 0.50x@device)
data/1000000.in: 1.00x (mem: 0.50x@device)
./misc/bfast/bfast-cloudy.fut
data/africa.in: 1.00x
data/peru.in: 1.00x
data/sahara-cloudy.in: 1.00x
./misc/bfast/bfast.fut
data/sahara.in: 1.00x
./misc/heston/heston32.fut
data/100000_quotes.in: 0.99x
data/10000_quotes.in: 0.92x
data/1062_quotes.in: 1.01x
./misc/heston/heston64.fut
data/100000_quotes.in: 1.00x
data/10000_quotes.in: 1.01x
data/1062_quotes.in: 0.93x
./misc/knn-by-kdtree/buildKDtree.fut
valid-data/kdtree-ppl-32-m-2097152.in: 1.00x
./misc/knn-by-kdtree/driver-knn.fut
256i32 [2097152][7]f32 [10000000][7]f32: 1.03x
./misc/ocean-sim/tke.fut
[200][200][100]f32 [200][200][100]f32 [200][200][100]f32 [...]: 2.27x
data/tke32-small.in: 1.02x
./misc/ocean-sim/tridiag-test.fut:tridagNested
[57600][115]f32 [57600][115]f32 [57600][115]f32 [57600][115]f32: 1.00x
data/tridiag32-small.in: 1.08x
./misc/ocean-sim/tridiag-test.fut:tridagNestedConst
[57600][115]f32 [57600][115]f32 [57600][115]f32 [57600][115]f32: 1.00x
data/tridiag32-small.in: 1.02x
./misc/ocean-sim/tridiag-test.fut:tridagNestedSeq
[57600][115]f32 [57600][115]f32 [57600][115]f32 [57600][115]f32: 1.10x (mem: 0.91x@device)
data/tridiag32-small.in: 1.04x (mem: 0.82x@device)
./misc/ocean-sim/tridiag-test.fut:tridagNestedSeqConst
[57600][115]f32 [57600][115]f32 [57600][115]f32 [57600][115]f32: 1.11x
data/tridiag32-small.in: 1.13x
./misc/poseidon/poseidon-bench.fut:arity11
[17600000]u64: 1.00x
./misc/poseidon/poseidon-bench.fut:arity8
[22400000]u64: 1.00x
./misc/radix_sort/radix_sort_blelloch_benchmark.fut
data/radix_sort_100K.in: 0.92x
data/radix_sort_10K.in: 0.72x
data/radix_sort_1M.in: 1.02x
./misc/radix_sort/radix_sort_large.fut
data/radix_sort_100K.in: 0.79x
data/radix_sort_10K.in: 0.99x
data/radix_sort_1M.in: 1.00x
./parboil/histo/histo.fut
data/default.in: 1.00x
data/large.in: 1.00x
./parboil/lbm/lbm.fut
data/120_120_150_ldc.in: 1.00x
./parboil/mri-q/mri-q.fut
data/large.in: 1.00x
data/small.in: 1.00x
./parboil/sgemm/sgemm.fut
data/medium.in: 1.00x
data/small.in: 1.00x
data/tiny.in: 1.03x
./parboil/stencil/stencil.fut
data/default.in: 1.00x
data/small.in: 0.99x
./parboil/tpacf/tpacf.fut
data/large.in: 1.00x
data/medium.in: 1.00x
data/small.in: 1.02x
./pbbs/ray/ray.fut
data/angel.in: 1.00x
data/dragon.in: 1.00x
data/happy.in: 1.00x
./rodinia/backprop/backprop.fut
data/medium.in: 1.00x
data/small.in: 1.07x
./rodinia/bfs/bfs_asympt_ok_but_slow.fut
data/4096nodes.in: 1.03x
data/512nodes_high_edge_variance.in: 1.06x
data/64kn_32e-var-1-256-skew.in: 1.01x
data/graph1MW_6.in: 0.94x
./rodinia/bfs/bfs_filt_padded_fused.fut
data/4096nodes.in: 1.09x
data/512nodes_high_edge_variance.in: 1.07x
data/64kn_32e-var-1-256-skew.in: 1.01x
data/graph1MW_6.in: 1.05x
./rodinia/bfs/bfs_heuristic.fut
data/4096nodes.in: 1.07x
data/512nodes_high_edge_variance.in: 1.09x
data/64kn_32e-var-1-256-skew.in: 1.02x
data/graph1MW_6.in: 1.00x
./rodinia/bfs/bfs_iter_work_ok.fut
data/4096nodes.in: 1.07x
data/512nodes_high_edge_variance.in: 1.04x
data/64kn_32e-var-1-256-skew.in: 1.03x
data/graph1MW_6.in: 1.04x
./rodinia/cfd/cfd.fut
data/fvcorr.domn.097K.toa: 1.00x
data/fvcorr.domn.193K.toa: 1.00x
./rodinia/hotspot/hotspot.fut
data/1024.in: 1.01x
data/512.in: 1.27x
data/64.in: 1.08x
./rodinia/kmeans/kmeans.fut
data/100.in: 1.06x
data/204800.in: 1.05x
data/kdd_cup.in: 1.02x
./rodinia/lavaMD/lavaMD.fut
data/10_boxes.in: 1.01x
data/3_boxes.in: 1.01x
./rodinia/lud/lud-clean.fut
[128][128][32][32]f32: 0.99x
data/b32.in: 0.68x
./rodinia/lud/lud-internal.fut
[128][128][16][16]f32: 1.01x
./rodinia/lud/lud.fut
data/16by16.in: 1.30x
data/2048.in: 0.98x
data/256.in: 0.99x
data/512.in: 0.98x
data/64.in: 0.98x
./rodinia/myocyte/myocyte.fut
data/medium.in: 1.26x
data/small.in: 1.21x
./rodinia/nn/nn.fut
data/medium.in: 1.18x
./rodinia/nw/nw-cosmin.fut:nw_flat
#3 ("3i64 10i32 [4i32, 2i32, 4i32, 9i32, 2i32, 1i32, 7i..."): 0.89x
16i64 10i32 [2362369]i32 [2362369]i32: 0.59x
32i64 10i32 [2362369]i32 [2362369]i32: 0.64x
64i64 10i32 [2362369]i32 [2362369]i32: 0.63x
./rodinia/nw/nw.fut
data/large.in: 0.70x
data/medium.in: 0.69x
data/small.in: 0.71x
data/tiny.in: 0.72x
./rodinia/particlefilter/particlefilter.fut
data/128_128_10_image_10000_particles.in: 1.03x
data/128_128_10_image_400000_particles.in: 1.00x
./rodinia/pathfinder/pathfinder.fut
data/medium.in: 0.82x
./rodinia/srad/srad.fut
data/image.in: 1.17x
./rsbench/rsbench.fut
data/large.in: 1.00x
data/small.in: 1.00x
./xsbench/xsbench.fut
data/large.in: 1.00x
data/small.in: 1.00x
The impact on OptionPricing, myocyte and nn are reproducible. I'm not sure what's going on with NW...
@athas I have performed some superficial cleaning. It's still not pretty, but I think it's important to get it merged before master diverges even further. Please take a look and let me know if there are any egregious things you want me to change.
Many of the tests have no input/output pairs and no structure stanza. I get that means they check if the compiler crashes, but my experience is that such tests are pretty fragile. I suggest adding structure unless absence of crashes is a major feat for those tests.
I have added structure tests to all new tests.
Differences in compile time seem minor and will hopefully be offset by future improvements elsewhere. Let's just merge it and see what happens.
@Munksgaard Do you fancy writing an approachable blog post on futhark-lang.org for this optimisation?
@Munksgaard Do you fancy writing an approachable blog post on futhark-lang.org for this optimisation?
Yes. Hopefully I'll be able to get around to it later this week.