soloud icon indicating copy to clipboard operation
soloud copied to clipboard

Identify and implement SSE/SSE2 optimized sections

Open jarikomppa opened this issue 9 years ago • 15 comments

There are some pieces that could benefit from SSE optimizations. For example, the roundoff clipper was sped up 3x in an experiment.

Other potential places may be the float->16bit converter, cubic resampler, 3d audio calculations, FFT maybe?

Based on profiling, vast majority of time goes to audio sources, though. Some filters may also be potential targets.

jarikomppa avatar May 26 '15 17:05 jarikomppa

Added SSE optimized clippers. A lot of changes were needed to make sure that the buffers are 16-byte aligned (basically buffers coming from backends & scratch buffers have to be).

Didn't update all backends yet, so some builds may be broken.

jarikomppa avatar May 27 '15 20:05 jarikomppa

Worked the aligned buffers a bit more, mix() interface is back to using unaligned buffers.

jarikomppa avatar May 28 '15 06:05 jarikomppa

Could always impose the restriction that buffers passed to mix() must be 16-byte aligned? Optionally have it just fall back to scalar processing if unaligned memory is passed in? Would be nice to catch or warn on this behaviour of course.

vk2gpu avatar May 28 '15 06:05 vk2gpu

The unaligned buffers thing isn't an issue anymore, I just shuffled the way buffer-to-buffer processing was done in mix(), and got rid of a buffer copy for 16bit samples at the same time.

jarikomppa avatar May 28 '15 06:05 jarikomppa

Also, requiring aligned buffers over foreign interfaces would not have worked.

jarikomppa avatar May 28 '15 06:05 jarikomppa

One thing I did consider (and may end up doing in other SSE optimizations should I do more) is to handle the start and end as scalars and the aligned middle as SSE.

jarikomppa avatar May 28 '15 06:05 jarikomppa

Thanks to seeing email notifications SoLoud has been brought to my attention again....

One thing I have been thinking about for when I get round to adding sound back into my home engine is looking at writing some ISPC kernels for performing mixing, clipping, filters, etc. Curious how you'd feel about that? I wouldn't say it needs to be embedded into SoLoud itself, but the ability to provide your own functions to perform that processing could be nice (perhaps just on a #define rather than adding to the interfaces)

vk2gpu avatar Feb 21 '18 09:02 vk2gpu

There is an other issue with the SSE clipping, if the number of samples requested is not a multiple of 4. I fixed this by making sure that the scratch buffer size is always rounded up and by changing clip to process a rounded up number of samples. This will clip up to 3 additional samples at the end that contain undefined data, but that shouldn't be an issue since they won't be read.

Changed code: postinit: mScratchSize = (aBufferSize+3)&~0x3; clip: for (i = 0; i < (aSamples+3) / 4; i++)

stegei avatar May 15 '18 13:05 stegei

Wait, what? Sample buffer that's not divisible by 4? What back end?

On Tue, May 15, 2018, 16:43 stegei [email protected] wrote:

There is an other issue with the SSE clipping, if the number of samples requested is not a multiple of 4. I fixed this by making sure that the scratch buffer size is always rounded up and by changing clip to process a rounded up number of samples. This will clip up to 3 additional samples at the end that contain undefined data, but that shouldn't be an issue since they won't be read.

Changed code: postinit: mScratchSize = (aBufferSize+3)&~0x3; clip: for (i = 0; i < (aSamples+3) / 4; i++)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jarikomppa/soloud/issues/103#issuecomment-389171193, or mute the thread https://github.com/notifications/unsubscribe-auth/AEQ_R571rQ9NFHSZYWZlDpGm2kcD74gpks5tyttmgaJpZM4Eqj47 .

jarikomppa avatar May 15 '18 15:05 jarikomppa

I'm using a custom Wasapi backend which uses a buffer size that is a multiple of the minimum device period (as reported by audioClient->GetDevicePeriod).

stegei avatar May 16 '18 09:05 stegei

Curious. I wouldn't have expected that. Thanks for catching this.

On Wed, May 16, 2018 at 12:30 PM, stegei [email protected] wrote:

I'm using a custom Wasapi backend which uses a buffer size that is a multiple of the minimum device period (as reported by audioClient->GetDevicePeriod).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jarikomppa/soloud/issues/103#issuecomment-389456489, or mute the thread https://github.com/notifications/unsubscribe-auth/AEQ_R3CO1DS78MeE6Z66EB_TYb2qk64mks5ty_G9gaJpZM4Eqj47 .

jarikomppa avatar May 16 '18 10:05 jarikomppa

#235 adds support for float to short interlacing for mono and stereo streams (processes 16 floats at a time for either cases). I ended up using 2 SSE2 instructions which make it easier to implement. Since, SSE2 was first introduced in 2001, I think it won't be much problem. Maybe, we could add another macro to control SSE2 blocks instead of SOLOUD_SSE_INTRINSICS. Because, the macro name currently is misleading. Another option might be to include following macros instead of it:

SOLOUD_ASM_INTRINSICS // Controls all assembly intrinsics at once
SOLOUD_ASM_INTRINSICS_SSE // Controls only SSE intrinsics
SOLOUD_ASM_INTRINSICS_SSE2 // Controls only SSE/SSE2 intrinsics
// ...
SOLOUD_ASM_INTRINSICS_NEON // Controls only Neon intrinsics
// etc.

osman-turan avatar Oct 30 '19 15:10 osman-turan

According to steam hardware survey, SSE3 is now safe to use. SSE2 has been the default ISA in visual studio for a few years now, so SSE2 is definitely not a problem.

panAndExpand shows up as a major player in performance metrics, so it looks like a good candidate for SIMD.

jarikomppa avatar Feb 07 '20 13:02 jarikomppa

Recent release sets fpu flags for audio thread to consider denorms as zero, which had a clear performance boost in a simple synthetic test.

jarikomppa avatar Feb 22 '20 08:02 jarikomppa

Added simd optimizations for panAndExpand cases 1->2 and 2->2, which are (probably) the most common ones. Speedup was massive - test that took 0.150 seconds now takes 0.030.

jarikomppa avatar Feb 22 '20 14:02 jarikomppa