rten
rten copied to clipboard
Make unary ops more efficient with non-contiguous inputs
Unary operators (eg. sigmoid, tanh) are much less efficient with non-contiguous inputs. The problem is two-fold:
- For SIMD-vectorized operators (eg. tanh), the fast path calls a SIMD function that applies the operator to the entire contiguous buffer at once. For non-contiguous inputs, it falls back to iterating over the input and applying the operator on one element at a time. Even worse, the fast path is parallel whereas the fallback is not
- In the slow path for
TensorBase::apply, it uses an iterator which is much less efficient than iterating over contiguous inputs. See also https://github.com/robertknight/rten/issues/189.
A better implementation would be something like:
- Sort dimensions into maximally-contiguous order
- If the size of the longest contiguous chunks is above a threshold, iterate over them and apply the operator
- If the size is below the threshold, use nested loops instead of an iterator, ala. https://github.com/robertknight/rten/issues/189
Once this is done, copying activations in RNN operators (eg. GRU, LSTM) can be replaced with their in-place versions to reduce copying.
- [x] Change
TensorBase::applyto avoid using an iterator (https://github.com/robertknight/rten/pull/223) - [ ] Add better slow path for SIMD-vectorized unary ops