libsamplerate icon indicating copy to clipboard operation
libsamplerate copied to clipboard

src_float_to_short_array: Use division instead of shift

Open erikd opened this issue 6 years ago • 7 comments

@janstary I wonder if this fixes your issue on PowerPC.

erikd avatar Aug 07 '19 22:08 erikd

No it won't fix the issue and it is slower due to the integer division (and harder to understand then the solution proposed in https://github.com/erikd/libsamplerate/issues/84#issuecomment-518963188)

The PowerPC problem is a broken lrint which produces wrong results for large values as show in https://github.com/erikd/libsamplerate/issues/65#issuecomment-516454507 As your change works on the already wrong value, it can't get correct.

An example: lrint(INT_MAX) = lrint(2147483647.000000) = -1
Even using smaller values runs into the same: lrint(2^30 - 2) = lrint(1073741822.000000) = -2

Flamefire avatar Aug 09 '19 10:08 Flamefire

slower due to the integer division

On modern CPUs with deep caches, branches are usually considered more expensive than division. If you can prove to me that my solution using division is slower than your solution, I will take your solution.

The PowerPC problem is a broken lrint which

If PowerPC has a broken lrint function then I will accept a PowerPC #if solution, but i will not take a performance hit on this code for x86_64 just because PowerPC's lrint is busted.

Meanwhile, I am still interested if this fixes @janstary's issue.

erikd avatar Aug 09 '19 11:08 erikd

Meanwhile, I am still interested if this fixes @janstary's issue.

Haven't we already proved that the output of lrint on his machine is wrong for values definitely used here? How could a change working on that wrong value fix anything?

On modern CPUs with deep caches, branches are usually considered more expensive than division.

Speculative execution mitigates some of that. AND:

If you can prove to me that my solution using division is slower than your solution, I will take your solution.

Prove is simple: LONG on most 64bit systems is 64bit: https://godbolt.org/z/HMv8iu
This will disable the "clipping optimization" (which just heuristically checks implementation defined behavior) Hence my code has the exact same branches and float-multiplications but 1 less shift/division and hence is faster.

Flamefire avatar Aug 09 '19 11:08 Flamefire

@janstary I wonder if this fixes your issue on PowerPC.

tests/float_short_test

        float_to_short_test ............................. 

        Line 64 : out [1] == 0

janstary avatar Aug 09 '19 15:08 janstary

I also tested on an arm machine (current OpenBSD/armv7 on a BeagleBone Black). Without this patch, it fails with

        float_to_short_test ............................. 
        Line 64 : out [1] == -1

With the patch, it fails as

        float_to_short_test ............................. 
        Line 64 : out [1] == 0

janstary avatar Aug 09 '19 16:08 janstary

Prove is simple:

That's not proof, that's opinion. Proof requires a proper benchmark.

This will disable the "clipping optimization" (which just heuristically checks implementation defined behavior) Hence my code has the exact same branches

You do realize that for CPUs where it works (eg at least x86 and x86_64) the "clipping optimization" results in zero branches, don't you?

erikd avatar Aug 09 '19 22:08 erikd

That's not proof, that's opinion. Proof requires a proper benchmark.

No, you are missing my point. For CPU_CLIPS_POSITIVE ==0 && CPU_CLIPS_NEGATIVE == 0 (what I called "disabled clipping optimization") your code and my code are exactly the same (except the used constants) and both have (at least) 2 branches. BUT my code has 1 operation less (either shift or division) , and hence it is faster. Do you agree until here?

Now examine what the clipping optimization is and when it is enabled:

  1. It assumes that for x>= LONG_MAX: lrint(x) == LONG_MAX and x<= LONG_MIN: lrint(x) == LONG_MIN which is implementation defined. I know if 1 implementation (can't find it again right now) which does this by if-checks on the float
  2. It assumes that long is 32 bit
  3. Given 1 & 2 it holds that lrint(x * 8.0 * 0x10000000) == INT32_MAX for x>=1.0 (and similar for negative numbers)
  4. the behavior of 3 is checked at configure

My claim: On most architectures the clipping optimization is disabled.
If this is true then on most architectures my code is faster.

Quick check: Do you have a "common" 64 bit Linux on a x86_64 architecture? Then please run configure and report the values for CPU_CLIPS_POSITIVE, CPU_CLIPS_NEGATIVE & SIZEOF_LONG. For me those are 0, 0, 8

Proof for "most architectures":

  • The optimization does not work for sizeof(long) == 8. 1 Example: for x==1 the expression lrint(x * 8.0 * 0x10000000) yields INT32_MAX+1. This shifted by 16 results in INT32_MIN. Proof: https://godbolt.org/z/zS1K-q
  • "Most" 64bit architectures use the LP64 model where long is 64 bit (as opposed to LLP64 where it is 32 bit): https://en.wikipedia.org/wiki/64-bit_computing

Many 64-bit platforms today use an LP64 model (including Solaris, AIX, HP-UX, Linux, macOS, BSD, and IBM z/OS). Microsoft Windows uses an LLP64 model.

  • Due to configure being used I assume Linux as the main target system, not Windows.

Summary:

  1. Linux is the target system
  2. Linux uses LP64 model with sizeof(long)==64
  3. sizeof(long)==64 disables clipping optimization
  4. With disabled clipping optimization your code execute 1 more instruction (hence is slower) and is less readable

Flamefire avatar Aug 10 '19 10:08 Flamefire