xsimd
xsimd copied to clipboard
Implement transpose
We should implement a transpose operation to transpose NxN matrix blocks (where N is batch width). The interface should probably look like the one for haddp, e.g. taking a pointer of rows.
template <class T, N>
void transpose(batch<T, N>* rows)
{
... inplace transpose ...
}
Note MIPP has very readable implementations for this. https://github.com/aff3ct/MIPP
Also implementations available in PacketMath of Eigen3.