Make unary ops more efficient with non-contiguous inputs

Open robertknight opened this issue 1 year ago • 0 comments

Unary operators (eg. sigmoid, tanh) are much less efficient with non-contiguous inputs. The problem is two-fold:

For SIMD-vectorized operators (eg. tanh), the fast path calls a SIMD function that applies the operator to the entire contiguous buffer at once. For non-contiguous inputs, it falls back to iterating over the input and applying the operator on one element at a time. Even worse, the fast path is parallel whereas the fallback is not
In the slow path for TensorBase::apply, it uses an iterator which is much less efficient than iterating over contiguous inputs. See also https://github.com/robertknight/rten/issues/189.

A better implementation would be something like:

Sort dimensions into maximally-contiguous order
If the size of the longest contiguous chunks is above a threshold, iterate over them and apply the operator
If the size is below the threshold, use nested loops instead of an iterator, ala. https://github.com/robertknight/rten/issues/189

Once this is done, copying activations in RNN operators (eg. GRU, LSTM) can be replaced with their in-place versions to reduce copying.

[x] Change TensorBase::apply to avoid using an iterator (https://github.com/robertknight/rten/pull/223)
[ ] Add better slow path for SIMD-vectorized unary ops

May 20 '24 06:05 robertknight