BenchmarksGame.jl
BenchmarksGame.jl copied to clipboard
WIP: Hacky(!) but faster nbody implementations
Here are a couple of messy implementations that, at least on my laptop with --target-cpu=core2
(the architecture of the actual BenchmarksGame test machine), beat the current ...simd.jl
by about 60% and 40%:
impl | 1st run | 2nd run | speedup vs. simd |
---|---|---|---|
simd | 5.95s | 5.75s | - |
unsafe_simd | 7.6s | 4.15s | 40% |
unsafe_simd_unroll | 7.3s | 3.6s | 60% |
Rust#7 | - | 3.1s | 85% |
Would like some feedback before cleaning this up further (and getting too deep in this rabbit hole 🙂), in particular whether this is helpful for showing off the language, since the code is getting far from idiomatic Julia.
A few caveats:
- Not idiomatic Julia because we're porting gcc #4 and rust #7, which liberally use SIMD intrinsics, lay memory out by hand, etc.
- Compilation time is much longer (probably due to using
StaticArrays
), so this awaits Julia AOT (#35) to show real gains; or I can switch usingNTuple
s with unsafe stores.
Btw, the ...unroll.jl
file has a hacky macro that fully unrolls some of the inner loops. This mimics how Rust #7 achieves its speedup: rustc
is smart enough to automatically unroll the (outer) for loops inside advance
, e.g. rsqrt
is seen 5 times in decompiled asm.
I didn't go all the way to unrolling the stride-2 loop, but could be persuaded to hack something up just to see how much improvement can be found.
@KristofferC Thanks again for your help getting intrinsics working.
Very cool. I think we can probably polish it to be more idiomatic Julia over time. I've been trying to get this one faster using simd intrinsics on my machine for a while now and mostly failing.
My preference would be to just use NTuples with unsafe_store! for now. I think getting AOT compilation working consistently on the benchmarks-game machine might be a ways off. And it might not be accepted at all depending on whether the maintainer wants to deal with the headache.