Audio: Use generic saturation logic for improved efficiency
Introduce a arithmetic bitwise saturation function operations across different integer sizes (8, 16, 24, 32 bits) in the processing block.
-
Replaced
if-elsechecks with bitwise masks to handle overflow and underflow for int32, int24, int16, and int8 saturation functions. -
For each bit width, created masks by shifting the results of comparisons (x -
MIN_VAL) and (MAX_VAL- x) to the respective overflow/underflow detection bits. -
Mask >>
31or >>63results in either0(no overflow/underflow) or-1(overflow/underflow occurred). -
Applied bitwise operations to conditionally replace original values with boundary values (
MINorMAX) based on the computed masks. -
This approach avoids branching, improves efficiency, and ensures accurate saturation within specified bit-width limits.
There's two commits from your previous PRs included to this, only the third commit is relevant. I don't think there's any dependency so you can leave them out from this.
optimisation attempt seems to have failed
The performance gap has been addressed with the most recent modification, in which the Cycles (Original) Before and Cycles (Modified) After columns are used for performance cycle count, and the overall result is that there is no performance difference across 32, 24, 16 and 8-bit saturation.
Here's a summary of the cycles before and after the modification :
sat_int32: No change in cycle count before and after changes, resulting in 19 cycles. sat_int24: No change in cycle count before and after changes, resulting in 16 cycles. sat_int16: No change in cycle count before and after changes, resulting in 13 cycles. sat_int8: No change in cycle count before and after changes, resulting in 13 cycles.
ACE-3
There's gain of 1 cycle count. i.e. sat_int32: decreased from 21 to 20 cycles
Generic HiFi5
Generic HiFi4
Generic HiFi3
There's gain of 1 cycle count. i.e. sat_int32: decreased from 21 to 20 cycles
@ShriramShastry this is the sort of data we need for every change as its not really clear from all the table (there is too much data). We just need to know the before and after in a simplified form like your comment above.
Here's a summary of the cycles before and after the modification :
sat_int32: No change (20 cycles both before and after)
sat_int24: Increased from 16 to 21 cycles
sat_int16: Increased from 13 to 21 cycles
sat_int8: Increased from 13 to 20 cycles
Ok, so this looks like a small gain for int32, but a bigger loss for 24, 16 and 8 and means the PR should not be merged until we can show no loss for any format. Do you have a plan to rework ?
There's gain of 1 cycle count. i.e. sat_int32: decreased from 21 to 20 cycles
@ShriramShastry this is the sort of data we need for every change as its not really clear from all the table (there is too much data). We just need to know the before and after in a simplified form like your comment above.
Here's a summary of the cycles before and after the modification :
sat_int32: No change (20 cycles both before and after) sat_int24: Increased from 16 to 21 cycles sat_int16: Increased from 13 to 21 cycles sat_int8: Increased from 13 to 20 cyclesOk, so this looks like a small gain for int32, but a bigger loss for 24, 16 and 8 and means the PR should not be merged until we can show no loss for any format. Do you have a plan to rework ?
I have reworked on sat-24,16 and 8 code and addressed performance loss !! With latest check-in there is no performance difference between Before and After my modification.
I 've added the latest results here : https://github.com/thesofproject/sof/pull/9170#issuecomment-2211083463. Please take a look.
Thank you.
I have reworked on sat-24,16 and 8 code and addressed performance loss !! With latest check-in there is no performance difference between
BeforeandAftermy modification.
so, there's no need for this PR then. We don't make changes just because they don't make things worse. We only make changes if they make something better.
I have reworked on sat-24,16 and 8 code and addressed performance loss !! With latest check-in there is no performance difference between
BeforeandAftermy modification.so, there's no need for this PR then. We don't make changes just because they don't make things worse. We only make changes if they make something better.
Thank you for your feedback on the recent changes. I would like to address your concerns regarding performance.
Summary of Changes
The performance evaluation indicates that there is no observable performance loss due to the recent changes. Below is a summary of results from the performance evaluation:
Benefits of the Optimization The aim of this optimization was not solely to enhance performance but also to enhance code reliability and maintainability. Here are the key benefits of the bitwise approach:
Consistent Performance:
There is no performance degradation; the cycle counts before and after the optimization remain the same across all tested values.
Branch Prediction:
The use of bitwise operations eliminates branches, which can reduce the risk of branch mispredictions and enhance predictability in pipeline performance.
Code Clarity and Maintenance:
The bitwise approach makes the code more compact and potentially less error-prone, improving readability and future maintainability.
Future-proofing:
Branch-free code is often favored in modern compiler and processor optimizations, which might yield further advantages on future architectures.
Conclusion To summarize, the optimization has no negative impact on performance, maintains the same cycle counts as the original implementation, and provides additional long-term benefits in terms of code quality and predictability. Therefore, this change does not degrade performance and is a worthwhile improvement.
To summarize, the optimization has no negative impact on performance, maintains the same cycle counts as the original implementation
Then you need to update the commit title and commit message. As they stand:
This approach avoids branching, improves efficiency, and ensures accurate saturation within specified bit-width limits.
Was there some accuracy issue that this is fixing?
This seems missing from the commit message too:
which might yield further advantages on future architectures.
If this just a code clean-up then say "more readable code" and don't mention performance at all. If it's more than a clean-up then you need some data or reference. EDIT: it's not a clean-up at all, the new code is much harder to follow.
I did a performance comparison with git main vs main + #9170 with TGL build with gcc and xcc. The gcc build currently can't run enabled DRC (reason unknown, a regression or just too much load) so I used different configurations for hw:0,0 playback on sof-hda-generic-4ch.tplg
GCC (gain 1.1 44, gain 2.1 44, EQIIR eq_iir_flat.txt, EQFIR eq_fir_flat.txt, DRC passthrough.txt off)
- main 69.8 MCPS
- 9170 74.2 MCPS
XCC (gain 1.1 44, gain 2.1 44, EQIIR eq_iif_spk.txt, EQFIR eq_fir_spk.txt, DRC threshold_-25_knee_15_ratio_10.txt on)
- main 96.8 MCPS
- 9170 96.8 MCPS
I'll see if I can repeat this with some remote MTL device. But this doesn't look good for TGL.
Similar result with MTL, no improvement with xcc build, and about same 4 MCPS slower with gcc build of this PR.
So... I hate to say it but I'm pretty sure this change is just a big noop to the optimizer. Here's some test code that extracts the two variants in individually compilable form, and a just-non-trivial-enough outer loop to let the compiler try to use the code in a natural setting. And the generated code from xt-clang (RI-2022.10, at -Os) is essentially identical. Both variants reduce to SALTU instructions. There is no branching in the current code nor masking in the new one; the compiler is smart enough to recognize what's happening.
#define INT_MAX 0x7fffffff
#define INT_MIN 0x80000000
static inline int sat32_cmp(long long x)
{
return x > INT_MAX ? INT_MAX : (x < INT_MIN ? INT_MIN : x);
}
static inline int sat32_mask(long long x)
{
long long mask_overflow = (INT_MAX - x) >> 63;
long long mask_underflow = (x - INT_MIN) >> 63;
x = (x & ~mask_overflow) | (INT_MAX & mask_overflow);
x = (x & ~mask_underflow) | (INT_MIN & mask_underflow);
return (int) x;
}
void sat_array_add_cmp(int *dst, int *a, int *b, int n)
{
for (int i = 0; i < n; i++) {
dst[i] = sat32_cmp(a[i] + (long long) b[i]);
}
}
void sat_array_add_mask(int *dst, int *a, int *b, int n)
{
for (int i = 0; i < n; i++) {
dst[i] = sat32_mask(a[i] + (long long) b[i]);
}
}
And indeed, gcc[1] is not smart enough to figure it out, and emits the code more or less as written. But... honestly my read is that the branch-based code is better and not worse. It's a lot fewer instructions, and Xtensa pipelines are really short in practice. Certainly I've never bumped up against branch stalls as a performance case myself; most branches are very cheap.
[1] Also a clang I have built from the Espressif LLVM tree that I've been playing with. It actually appears to build Zephyr for SOF targets pretty well, if anything slightly better code than gcc, and all the variants can go in one toolchain instead of N. I need to get that cleaned up and submitted somewhere, with a writeup for Zephyr...
I used for test these compiler versions
- TGL gcc: xtensa-intel_tgl_adsp_zephyr-elf-gcc (Zephyr SDK 0.16.4) 12.2.0
- MTL gcc: xtensa-intel_ace15_mtpm_zephyr-elf-gcc (Zephyr SDK 0.16.4) 12.2.0
- TGL xcc: xt-xcc version 12.0.8, RG-2017.8-linux, cavs2x_LX6HiFi3_2017_8
- MTL xcc: XtensaTools-14.10 clang version 10.0.1, RI-2022.10-linux, ace10_LX7HiFi4_2022_10
Detailed resuls:
Used algorithms settings for xcc
amixer cset name='Post Mixer Analog Playback Volume' 44
amixer cset name='Pre Mixer Analog Playback Volume' 44
sof-ctl -n 34 -s ~/ctl4/eq_iir_spk.txt
sof-ctl -n 35 -s ~/ctl4/eq_fir_spk.txt
sof-ctl -n 36 -s ~/ctl4/drc/threshold_-25_knee_15_ratio_10.txt
amixer cset name='Post Mixer Analog Playback DRC switch' on
Settings for gcc
amixer cset name='Post Mixer Analog Playback Volume' 44
amixer cset name='Pre Mixer Analog Playback Volume' 44
sof-ctl -n 34 -s ~/ctl4/eq_iir_flat.txt
sof-ctl -n 35 -s ~/ctl4/eq_fir_flat.txt
sof-ctl -n 36 -s ~/ctl4/drc/passthrough.txt
amixer cset name='Post Mixer Analog Playback DRC switch' off
(sorry, should use those long names instead of numids with sof-ctl, in TGL and MTL the numids are not the same, while the the long string names are)
TGL gcc: xtensa-intel_tgl_adsp_zephyr-elf-gcc (Zephyr SDK 0.16.4) 12.2.0 MTL gcc: xtensa-intel_ace15_mtpm_zephyr-elf-gcc (Zephyr SDK 0.16.4) 12.2.0 TGL xcc: xt-xcc version 12.0.8, RG-2017.8-linux, cavs2x_LX6HiFi3_2017_8 MTL xcc: XtensaTools-14.10 clang version 10.0.1, RI-2022.10-linux, ace10_LX7HiFi4_2022_10
You can get the front-end version with -dumpversion
RG-2017.8-linux/XtensaTools/bin/xt-xcc -dumpversion (gcc frontend)
4.2.0
RI-2022.10-linux/XtensaTools/bin/xt-clang -dumpversion (clang frontend)
10.0.1
@ShriramShastry I assume this PR has been split up and been re-created into smaller PRs ? if so, we should close and focus on the new PRs.