pythran icon indicating copy to clipboard operation
pythran copied to clipboard

xsimd makes a function slower

Open paugier opened this issue 4 years ago • 2 comments

I'm surprised that xsimd makes a function slower + the implementation with explicit loops is faster:

In [1]: run microbench_simd.py                                                                                                              

In [2]: %timeit advance_positions(positions, velocities, accelerations, time_step)                                                          
4.16 µs ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [3]: %timeit advance_positions_nosimd(positions, velocities, accelerations, time_step)                                                   
1.7 µs ± 6.47 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit advance_positions_simd(positions, velocities, accelerations, time_step)                                                     
1.95 µs ± 11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit advance_positions_loops(positions, velocities, accelerations, time_step)                                                    
1.45 µs ± 3.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Is it expected?

The full code:

import numpy as np
from transonic import jit, wait_for_all_extensions

def advance_positions(positions, velocities, accelerations, time_step):
    positions += time_step * velocities + 0.5 * time_step ** 2 * accelerations

@jit(native=True, xsimd=False)
def advance_positions_nosimd(positions, velocities, accelerations, time_step):
    positions += time_step * velocities + 0.5 * time_step ** 2 * accelerations

@jit(native=True, xsimd=True)
def advance_positions_simd(positions, velocities, accelerations, time_step):
    positions += time_step * velocities + 0.5 * time_step ** 2 * accelerations

@jit(native=True, xsimd=False)
def advance_positions_loops(positions, velocities, accelerations, time_step):
    n0, n1 = positions.shape
    for i0 in range(n0):
        for i1 in range(n1):
            positions[i0, i1] += time_step * velocities[i0, i1] + 0.5 * time_step ** 2 * accelerations[i0, i1]

shape = 256, 3
positions = np.zeros(shape)
velocities = np.zeros_like(positions)
accelerations = np.zeros_like(positions)
time_step = 1.0

advance_positions_nosimd(positions, velocities, accelerations, time_step)
advance_positions_simd(positions, velocities, accelerations, time_step)
advance_positions_loops(positions, velocities, accelerations, time_step)

wait_for_all_extensions()

paugier avatar Nov 28 '20 20:11 paugier

The pure-python version is already quite fast clocking at 4.16 µs. So the benchmarks are quite close already! Here is what I get in my laptop.

In [3]:  %timeit advance_positions(positions, velocities, accelerations, time_step)
4.64 µs ± 108 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [4]: %timeit advance_positions_nosimd(positions, velocities, accelerations, time_step)
1.87 µs ± 21.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [5]: %timeit advance_positions_simd(positions, velocities, accelerations, time_step)
1.9 µs ± 32 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [6]: %timeit advance_positions_loops(positions, velocities, accelerations, time_step)
1.79 µs ± 59.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Increasing shape = 10240, 3 reduces the gap further

In [13]: %timeit advance_positions(positions, velocities, accelerations, time_step)
52.4 µs ± 509 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [14]: %timeit advance_positions_nosimd(positions, velocities, accelerations, time_step)
52.8 µs ± 74.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [15]: %timeit advance_positions_simd(positions, velocities, accelerations, time_step)
52.8 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [16]: %timeit advance_positions_loops(positions, velocities, accelerations, time_step)
49.1 µs ± 391 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

ashwinvis avatar Dec 08 '20 20:12 ashwinvis

It's also depend on the processor. I get better result with xsimd on another computer.

What I don't like is that for arrays with shape (any_moderately_large_dim, 3) it is really faster to write the expression on reshaped 1d arrays (no copy if the array are contiguous):

Something like:

@jit(native=True, xsimd=False)
def advance_positions_nosimd_rs(positions, velocities, accelerations, time_step):
    size = positions.size
    positions_rs = positions.reshape(size)
    velocities_rs = velocities.reshape(size)
    accelerations_rs = accelerations.reshape(size)
    positions_rs += (
        time_step * velocities_rs + 0.5 * time_step ** 2 * accelerations_rs
    )

This function is clearly more specialized but I guess it could be possible to have a specialized case to do that automatically if all arrays in the expression have the same shape. Does it make sense?

This script https://gist.github.com/paugier/416b2daeb4e0ae98faba2d33e7d7b87c gives

shape: (1024, 3)
advance_positions                :     1 * norm
norm = 1.3e-05 s
advance_positions_simd           : 0.638 * norm
advance_positions_nosimd         : 0.634 * norm
advance_positions_loops          : 0.566 * norm
advance_positions_simd_rs        : 0.202 * norm
advance_positions_nosimd_rs      : 0.199 * norm
advance_positions_loops_rs       : 0.196 * norm

paugier avatar Dec 11 '20 06:12 paugier