sof icon indicating copy to clipboard operation
sof copied to clipboard

Audio: Use generic saturation logic for improved efficiency

Open ShriramShastry opened this issue 1 year ago • 4 comments

Introduce a arithmetic bitwise saturation function operations across different integer sizes (8, 16, 24, 32 bits) in the processing block.

  • Replaced if-else checks with bitwise masks to handle overflow and underflow for int32, int24, int16, and int8 saturation functions.

  • For each bit width, created masks by shifting the results of comparisons (x - MIN_VAL) and (MAX_VAL - x) to the respective overflow/underflow detection bits.

  • Mask >> 31 or >> 63 results in either 0 (no overflow/underflow) or -1 (overflow/underflow occurred).

  • Applied bitwise operations to conditionally replace original values with boundary values (MIN or MAX) based on the computed masks.

  • This approach avoids branching, improves efficiency, and ensures accurate saturation within specified bit-width limits.

ShriramShastry avatar May 27 '24 12:05 ShriramShastry

There's two commits from your previous PRs included to this, only the third commit is relevant. I don't think there's any dependency so you can leave them out from this.

singalsu avatar May 27 '24 15:05 singalsu

optimisation attempt seems to have failed

The performance gap has been addressed with the most recent modification, in which the Cycles (Original) Before and Cycles (Modified) After columns are used for performance cycle count, and the overall result is that there is no performance difference across 32, 24, 16 and 8-bit saturation.

Here's a summary of the cycles before and after the modification :

sat_int32: No change in cycle count before and after changes, resulting in 19 cycles. sat_int24: No change in cycle count before and after changes, resulting in 16 cycles. sat_int16: No change in cycle count before and after changes, resulting in 13 cycles. sat_int8: No change in cycle count before and after changes, resulting in 13 cycles.

ACE-3

There's gain of 1 cycle count. i.e. sat_int32: decreased from 21 to 20 cycles

image

Generic HiFi5

image

Generic HiFi4

image

Generic HiFi3

image

ShriramShastry avatar Jul 05 '24 15:07 ShriramShastry

There's gain of 1 cycle count. i.e. sat_int32: decreased from 21 to 20 cycles

@ShriramShastry this is the sort of data we need for every change as its not really clear from all the table (there is too much data). We just need to know the before and after in a simplified form like your comment above.

Here's a summary of the cycles before and after the modification :

sat_int32: No change (20 cycles both before and after)
sat_int24: Increased from 16 to 21 cycles
sat_int16: Increased from 13 to 21 cycles
sat_int8: Increased from 13 to 20 cycles

Ok, so this looks like a small gain for int32, but a bigger loss for 24, 16 and 8 and means the PR should not be merged until we can show no loss for any format. Do you have a plan to rework ?

lgirdwood avatar Jul 05 '24 15:07 lgirdwood

There's gain of 1 cycle count. i.e. sat_int32: decreased from 21 to 20 cycles

@ShriramShastry this is the sort of data we need for every change as its not really clear from all the table (there is too much data). We just need to know the before and after in a simplified form like your comment above.

Here's a summary of the cycles before and after the modification :

sat_int32: No change (20 cycles both before and after)
sat_int24: Increased from 16 to 21 cycles
sat_int16: Increased from 13 to 21 cycles
sat_int8: Increased from 13 to 20 cycles

Ok, so this looks like a small gain for int32, but a bigger loss for 24, 16 and 8 and means the PR should not be merged until we can show no loss for any format. Do you have a plan to rework ?

I have reworked on sat-24,16 and 8 code and addressed performance loss !! With latest check-in there is no performance difference between Before and After my modification.

I 've added the latest results here : https://github.com/thesofproject/sof/pull/9170#issuecomment-2211083463. Please take a look.

Thank you.

ShriramShastry avatar Jul 05 '24 16:07 ShriramShastry

I have reworked on sat-24,16 and 8 code and addressed performance loss !! With latest check-in there is no performance difference between Before and After my modification.

so, there's no need for this PR then. We don't make changes just because they don't make things worse. We only make changes if they make something better.

lyakh avatar Jul 08 '24 06:07 lyakh

I have reworked on sat-24,16 and 8 code and addressed performance loss !! With latest check-in there is no performance difference between Before and After my modification.

so, there's no need for this PR then. We don't make changes just because they don't make things worse. We only make changes if they make something better.

Thank you for your feedback on the recent changes. I would like to address your concerns regarding performance.

Summary of Changes The performance evaluation indicates that there is no observable performance loss due to the recent changes. Below is a summary of results from the performance evaluation:

image

Benefits of the Optimization The aim of this optimization was not solely to enhance performance but also to enhance code reliability and maintainability. Here are the key benefits of the bitwise approach:

Consistent Performance:

There is no performance degradation; the cycle counts before and after the optimization remain the same across all tested values.

Branch Prediction:

The use of bitwise operations eliminates branches, which can reduce the risk of branch mispredictions and enhance predictability in pipeline performance.

Code Clarity and Maintenance:

The bitwise approach makes the code more compact and potentially less error-prone, improving readability and future maintainability.

Future-proofing:

Branch-free code is often favored in modern compiler and processor optimizations, which might yield further advantages on future architectures.

Conclusion To summarize, the optimization has no negative impact on performance, maintains the same cycle counts as the original implementation, and provides additional long-term benefits in terms of code quality and predictability. Therefore, this change does not degrade performance and is a worthwhile improvement.

ShriramShastry avatar Jul 08 '24 09:07 ShriramShastry

To summarize, the optimization has no negative impact on performance, maintains the same cycle counts as the original implementation

Then you need to update the commit title and commit message. As they stand:

This approach avoids branching, improves efficiency, and ensures accurate saturation within specified bit-width limits.

Was there some accuracy issue that this is fixing?

This seems missing from the commit message too:

which might yield further advantages on future architectures.

If this just a code clean-up then say "more readable code" and don't mention performance at all. If it's more than a clean-up then you need some data or reference. EDIT: it's not a clean-up at all, the new code is much harder to follow.

marc-hb avatar Jul 08 '24 21:07 marc-hb

I did a performance comparison with git main vs main + #9170 with TGL build with gcc and xcc. The gcc build currently can't run enabled DRC (reason unknown, a regression or just too much load) so I used different configurations for hw:0,0 playback on sof-hda-generic-4ch.tplg

GCC (gain 1.1 44, gain 2.1 44, EQIIR eq_iir_flat.txt, EQFIR eq_fir_flat.txt, DRC passthrough.txt off)

  • main 69.8 MCPS
  • 9170 74.2 MCPS

XCC (gain 1.1 44, gain 2.1 44, EQIIR eq_iif_spk.txt, EQFIR eq_fir_spk.txt, DRC threshold_-25_knee_15_ratio_10.txt on)

  • main 96.8 MCPS
  • 9170 96.8 MCPS

I'll see if I can repeat this with some remote MTL device. But this doesn't look good for TGL.

singalsu avatar Jul 18 '24 13:07 singalsu

Similar result with MTL, no improvement with xcc build, and about same 4 MCPS slower with gcc build of this PR.

singalsu avatar Jul 18 '24 17:07 singalsu

So... I hate to say it but I'm pretty sure this change is just a big noop to the optimizer. Here's some test code that extracts the two variants in individually compilable form, and a just-non-trivial-enough outer loop to let the compiler try to use the code in a natural setting. And the generated code from xt-clang (RI-2022.10, at -Os) is essentially identical. Both variants reduce to SALTU instructions. There is no branching in the current code nor masking in the new one; the compiler is smart enough to recognize what's happening.

#define INT_MAX 0x7fffffff
#define INT_MIN 0x80000000

static inline int sat32_cmp(long long x)
{
	return x > INT_MAX ? INT_MAX : (x < INT_MIN ? INT_MIN : x);
}

static inline int sat32_mask(long long x)
{
	long long mask_overflow = (INT_MAX - x) >> 63;
	long long mask_underflow = (x - INT_MIN) >> 63;

	x = (x & ~mask_overflow) | (INT_MAX & mask_overflow);
	x = (x & ~mask_underflow) | (INT_MIN & mask_underflow);
	return (int) x;
}

void sat_array_add_cmp(int *dst, int *a, int *b, int n)
{
	for (int i = 0; i < n; i++) {
		dst[i] = sat32_cmp(a[i] + (long long) b[i]);
	}
}

void sat_array_add_mask(int *dst, int *a, int *b, int n)
{
	for (int i = 0; i < n; i++) {
		dst[i] = sat32_mask(a[i] + (long long) b[i]);
	}
}

andyross avatar Jul 18 '24 23:07 andyross

And indeed, gcc[1] is not smart enough to figure it out, and emits the code more or less as written. But... honestly my read is that the branch-based code is better and not worse. It's a lot fewer instructions, and Xtensa pipelines are really short in practice. Certainly I've never bumped up against branch stalls as a performance case myself; most branches are very cheap.

[1] Also a clang I have built from the Espressif LLVM tree that I've been playing with. It actually appears to build Zephyr for SOF targets pretty well, if anything slightly better code than gcc, and all the variants can go in one toolchain instead of N. I need to get that cleaned up and submitted somewhere, with a writeup for Zephyr...

andyross avatar Jul 18 '24 23:07 andyross

I used for test these compiler versions

  • TGL gcc: xtensa-intel_tgl_adsp_zephyr-elf-gcc (Zephyr SDK 0.16.4) 12.2.0
  • MTL gcc: xtensa-intel_ace15_mtpm_zephyr-elf-gcc (Zephyr SDK 0.16.4) 12.2.0
  • TGL xcc: xt-xcc version 12.0.8, RG-2017.8-linux, cavs2x_LX6HiFi3_2017_8
  • MTL xcc: XtensaTools-14.10 clang version 10.0.1, RI-2022.10-linux, ace10_LX7HiFi4_2022_10

Detailed resuls:

Screenshot from 2024-07-19 10-41-25

singalsu avatar Jul 19 '24 07:07 singalsu

Used algorithms settings for xcc

amixer cset name='Post Mixer Analog Playback Volume' 44
amixer cset name='Pre Mixer Analog Playback Volume' 44
sof-ctl -n 34 -s ~/ctl4/eq_iir_spk.txt 
sof-ctl -n 35 -s ~/ctl4/eq_fir_spk.txt 
sof-ctl -n 36 -s ~/ctl4/drc/threshold_-25_knee_15_ratio_10.txt
amixer cset name='Post Mixer Analog Playback DRC switch' on

Settings for gcc

amixer cset name='Post Mixer Analog Playback Volume' 44
amixer cset name='Pre Mixer Analog Playback Volume' 44
sof-ctl -n 34 -s ~/ctl4/eq_iir_flat.txt 
sof-ctl -n 35 -s ~/ctl4/eq_fir_flat.txt 
sof-ctl -n 36 -s ~/ctl4/drc/passthrough.txt
amixer cset name='Post Mixer Analog Playback DRC switch' off

(sorry, should use those long names instead of numids with sof-ctl, in TGL and MTL the numids are not the same, while the the long string names are)

singalsu avatar Jul 19 '24 07:07 singalsu

TGL gcc: xtensa-intel_tgl_adsp_zephyr-elf-gcc (Zephyr SDK 0.16.4) 12.2.0 MTL gcc: xtensa-intel_ace15_mtpm_zephyr-elf-gcc (Zephyr SDK 0.16.4) 12.2.0 TGL xcc: xt-xcc version 12.0.8, RG-2017.8-linux, cavs2x_LX6HiFi3_2017_8 MTL xcc: XtensaTools-14.10 clang version 10.0.1, RI-2022.10-linux, ace10_LX7HiFi4_2022_10

You can get the front-end version with -dumpversion

RG-2017.8-linux/XtensaTools/bin/xt-xcc  -dumpversion (gcc frontend)
4.2.0

RI-2022.10-linux/XtensaTools/bin/xt-clang -dumpversion (clang frontend)
10.0.1

marc-hb avatar Jul 20 '24 02:07 marc-hb

@ShriramShastry I assume this PR has been split up and been re-created into smaller PRs ? if so, we should close and focus on the new PRs.

lgirdwood avatar Aug 14 '24 15:08 lgirdwood