libsamplerate
libsamplerate copied to clipboard
src_float_to_short_array: Use division instead of shift
@janstary I wonder if this fixes your issue on PowerPC.
No it won't fix the issue and it is slower due to the integer division (and harder to understand then the solution proposed in https://github.com/erikd/libsamplerate/issues/84#issuecomment-518963188)
The PowerPC problem is a broken lrint which produces wrong results for large values as show in https://github.com/erikd/libsamplerate/issues/65#issuecomment-516454507 As your change works on the already wrong value, it can't get correct.
An example: lrint(INT_MAX) = lrint(2147483647.000000) = -1
Even using smaller values runs into the same: lrint(2^30 - 2) = lrint(1073741822.000000) = -2
slower due to the integer division
On modern CPUs with deep caches, branches are usually considered more expensive than division. If you can prove to me that my solution using division is slower than your solution, I will take your solution.
The PowerPC problem is a broken lrint which
If PowerPC has a broken lrint function then I will accept a PowerPC #if solution, but i will not take a performance hit on this code for x86_64 just because PowerPC's lrint is busted.
Meanwhile, I am still interested if this fixes @janstary's issue.
Meanwhile, I am still interested if this fixes @janstary's issue.
Haven't we already proved that the output of lrint on his machine is wrong for values definitely used here? How could a change working on that wrong value fix anything?
On modern CPUs with deep caches, branches are usually considered more expensive than division.
Speculative execution mitigates some of that. AND:
If you can prove to me that my solution using division is slower than your solution, I will take your solution.
Prove is simple: LONG on most 64bit systems is 64bit: https://godbolt.org/z/HMv8iu
This will disable the "clipping optimization" (which just heuristically checks implementation defined behavior) Hence my code has the exact same branches and float-multiplications but 1 less shift/division and hence is faster.
@janstary I wonder if this fixes your issue on PowerPC.
tests/float_short_test
float_to_short_test .............................
Line 64 : out [1] == 0
I also tested on an arm machine (current OpenBSD/armv7 on a BeagleBone Black). Without this patch, it fails with
float_to_short_test .............................
Line 64 : out [1] == -1
With the patch, it fails as
float_to_short_test .............................
Line 64 : out [1] == 0
Prove is simple:
That's not proof, that's opinion. Proof requires a proper benchmark.
This will disable the "clipping optimization" (which just heuristically checks implementation defined behavior) Hence my code has the exact same branches
You do realize that for CPUs where it works (eg at least x86 and x86_64) the "clipping optimization" results in zero branches, don't you?
That's not proof, that's opinion. Proof requires a proper benchmark.
No, you are missing my point. For CPU_CLIPS_POSITIVE ==0 && CPU_CLIPS_NEGATIVE == 0 (what I called "disabled clipping optimization") your code and my code are exactly the same (except the used constants) and both have (at least) 2 branches. BUT my code has 1 operation less (either shift or division) , and hence it is faster. Do you agree until here?
Now examine what the clipping optimization is and when it is enabled:
- It assumes that for
x>= LONG_MAX:lrint(x) == LONG_MAXandx<= LONG_MIN:lrint(x) == LONG_MINwhich is implementation defined. I know if 1 implementation (can't find it again right now) which does this byif-checks on the float - It assumes that
longis 32 bit - Given 1 & 2 it holds that
lrint(x * 8.0 * 0x10000000) == INT32_MAXforx>=1.0(and similar for negative numbers) - the behavior of 3 is checked at configure
My claim: On most architectures the clipping optimization is disabled.
If this is true then on most architectures my code is faster.
Quick check: Do you have a "common" 64 bit Linux on a x86_64 architecture? Then please run configure and report the values for CPU_CLIPS_POSITIVE, CPU_CLIPS_NEGATIVE & SIZEOF_LONG. For me those are 0, 0, 8
Proof for "most architectures":
- The optimization does not work for
sizeof(long) == 8. 1 Example: forx==1the expressionlrint(x * 8.0 * 0x10000000)yieldsINT32_MAX+1. This shifted by 16 results inINT32_MIN. Proof: https://godbolt.org/z/zS1K-q - "Most" 64bit architectures use the LP64 model where
longis 64 bit (as opposed to LLP64 where it is 32 bit): https://en.wikipedia.org/wiki/64-bit_computing
Many 64-bit platforms today use an LP64 model (including Solaris, AIX, HP-UX, Linux, macOS, BSD, and IBM z/OS). Microsoft Windows uses an LLP64 model.
- Due to
configurebeing used I assume Linux as the main target system, not Windows.
Summary:
- Linux is the target system
- Linux uses LP64 model with
sizeof(long)==64 sizeof(long)==64disables clipping optimization- With disabled clipping optimization your code execute 1 more instruction (hence is slower) and is less readable