minimp3
minimp3 copied to clipboard
Replace 8*PEXTRW with 1*MOVDQU in f32_to_s16
The existing code has a series of 8 sequential unrolled PEXTRW, which compilers generally cannot detect and optimize to a single MOVDQU instruction.
As such manually placing the optimized unaligned store intrinsic in place is an enormous performance win for SSE with identical output.