concentus icon indicating copy to clipboard operation
concentus copied to clipboard

Potential optimizations using System.Span, System.Memory, and stackalloc

Open lostromb opened this issue 6 years ago • 8 comments

Concentus interpreted almost all C pointers as a combination of an array and an integer offset, e.g. "int[] x, int x_ptr". These could potentially all be replaced by new Span<T> constructs in .NetCore 2.0 for better optimization. Furthermore, most stack array allocations are done on the heap in C#, which also yields bad performance (except where they were hand-inlined inside the FFT) See https://msdn.microsoft.com/en-us/magazine/mt814808.aspx

lostromb avatar May 31 '18 20:05 lostromb

Hello, sir. How can i help? Not really good at low-level stuff, but ready to try where I can

prepconcede avatar Sep 15 '21 13:09 prepconcede

Uh, yeah, sure. You can start with some of the tight-loop kernels such as in here. Replace the byte[] array, int array_offset, int array_length function signatures with Span<byte> and then work upwards until the build passes again. Run the parity test console for a while to ensure there's no regressions. There are also places that allocate temporary heap arrays which could be switched to stackalloc, such as here in the NLSF encoder. You will probably have to reference the original C code to determine safe fixed-length array sizes since you can't do variable-length stackalloc.

lostromb avatar Sep 15 '21 18:09 lostromb

I took a quick shot of this because I was curious if .Net 6 carries an measurable performance benefits for this application. Porting the obvious places in kernels, xcorr, VQ, and NLSF yielded a slight (1-2%) benefit. Interestingly, when I ported over the MDCT code to use spans, the performance decreased by about 2%. My hunch is that the way that pointers get incremented in C (such as in kf_bfly5()) run faster when they are interpreted as incrementing the array index variables Fout_ptr++ rather than doing Fout = Fout.Slice(1). So it seems likely that the hand-inlined code is already quite optimized in some places.

lostromb avatar Oct 01 '21 00:10 lostromb

Performance-wise, the current build runs about 40-50% as fast as its equivalent libopus build, mostly due to the lack of stack arrays and vectorized intrinsics in managed languages.

How ironic I was going to file this issue also as for c# this is no longer the case. Also I think Tanner Gooding knows of some intrinsics to increase perf on this as well. Just mention cc Tanner @tannergooding (sorry for the ping tanner).

AraHaan avatar Mar 15 '22 20:03 AraHaan

Uh, yeah, sure. You can start with some of the tight-loop kernels such as in here. Replace the byte[] array, int array_offset, int array_length function signatures with Span<byte> and then work upwards until the build passes again. Run the parity test console for a while to ensure there's no regressions. There are also places that allocate temporary heap arrays which could be switched to stackalloc, such as here in the NLSF encoder. You will probably have to reference the original C code to determine safe fixed-length array sizes since you can't do variable-length stackalloc.

for VL stackalloc I would rather rent an Memory<T> if that is possible currently.

AraHaan avatar Mar 15 '22 20:03 AraHaan