hlslpp
hlslpp copied to clipboard
Optimize NEON shuffles
They're too generic currently and inefficient. We can probably specialize most combinations using constructs such as
vcombine_f32(vget_high_f32(x), vget_low_f32(y)) vrev64q_f32(x)
etc.