pythran
pythran copied to clipboard
xsimd makes a function slower
I'm surprised that xsimd makes a function slower + the implementation with explicit loops is faster:
In [1]: run microbench_simd.py
In [2]: %timeit advance_positions(positions, velocities, accelerations, time_step)
4.16 µs ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: %timeit advance_positions_nosimd(positions, velocities, accelerations, time_step)
1.7 µs ± 6.47 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit advance_positions_simd(positions, velocities, accelerations, time_step)
1.95 µs ± 11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit advance_positions_loops(positions, velocities, accelerations, time_step)
1.45 µs ± 3.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Is it expected?
The full code:
import numpy as np
from transonic import jit, wait_for_all_extensions
def advance_positions(positions, velocities, accelerations, time_step):
positions += time_step * velocities + 0.5 * time_step ** 2 * accelerations
@jit(native=True, xsimd=False)
def advance_positions_nosimd(positions, velocities, accelerations, time_step):
positions += time_step * velocities + 0.5 * time_step ** 2 * accelerations
@jit(native=True, xsimd=True)
def advance_positions_simd(positions, velocities, accelerations, time_step):
positions += time_step * velocities + 0.5 * time_step ** 2 * accelerations
@jit(native=True, xsimd=False)
def advance_positions_loops(positions, velocities, accelerations, time_step):
n0, n1 = positions.shape
for i0 in range(n0):
for i1 in range(n1):
positions[i0, i1] += time_step * velocities[i0, i1] + 0.5 * time_step ** 2 * accelerations[i0, i1]
shape = 256, 3
positions = np.zeros(shape)
velocities = np.zeros_like(positions)
accelerations = np.zeros_like(positions)
time_step = 1.0
advance_positions_nosimd(positions, velocities, accelerations, time_step)
advance_positions_simd(positions, velocities, accelerations, time_step)
advance_positions_loops(positions, velocities, accelerations, time_step)
wait_for_all_extensions()
The pure-python version is already quite fast clocking at 4.16 µs
. So the benchmarks are quite close already! Here is what I get in my laptop.
In [3]: %timeit advance_positions(positions, velocities, accelerations, time_step)
4.64 µs ± 108 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [4]: %timeit advance_positions_nosimd(positions, velocities, accelerations, time_step)
1.87 µs ± 21.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [5]: %timeit advance_positions_simd(positions, velocities, accelerations, time_step)
1.9 µs ± 32 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [6]: %timeit advance_positions_loops(positions, velocities, accelerations, time_step)
1.79 µs ± 59.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Increasing shape = 10240, 3
reduces the gap further
In [13]: %timeit advance_positions(positions, velocities, accelerations, time_step)
52.4 µs ± 509 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [14]: %timeit advance_positions_nosimd(positions, velocities, accelerations, time_step)
52.8 µs ± 74.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [15]: %timeit advance_positions_simd(positions, velocities, accelerations, time_step)
52.8 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [16]: %timeit advance_positions_loops(positions, velocities, accelerations, time_step)
49.1 µs ± 391 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
It's also depend on the processor. I get better result with xsimd on another computer.
What I don't like is that for arrays with shape (any_moderately_large_dim, 3)
it is really faster to write the expression on reshaped 1d arrays (no copy if the array are contiguous):
Something like:
@jit(native=True, xsimd=False)
def advance_positions_nosimd_rs(positions, velocities, accelerations, time_step):
size = positions.size
positions_rs = positions.reshape(size)
velocities_rs = velocities.reshape(size)
accelerations_rs = accelerations.reshape(size)
positions_rs += (
time_step * velocities_rs + 0.5 * time_step ** 2 * accelerations_rs
)
This function is clearly more specialized but I guess it could be possible to have a specialized case to do that automatically if all arrays in the expression have the same shape. Does it make sense?
This script https://gist.github.com/paugier/416b2daeb4e0ae98faba2d33e7d7b87c gives
shape: (1024, 3)
advance_positions : 1 * norm
norm = 1.3e-05 s
advance_positions_simd : 0.638 * norm
advance_positions_nosimd : 0.634 * norm
advance_positions_loops : 0.566 * norm
advance_positions_simd_rs : 0.202 * norm
advance_positions_nosimd_rs : 0.199 * norm
advance_positions_loops_rs : 0.196 * norm