druntime icon indicating copy to clipboard operation
druntime copied to clipboard

Rewrite `core.checkedint` functions.

Open denis-sh opened this issue 10 years ago • 29 comments

Use fast and simple sign comparison instead of long, slow and complicated value comparisons.

Also remove reference as it doesn't apply anymore.

P.S.

I planned to rewrite this module implementation since its creation because it's clearly written in a hurry (e.g. see r < x || r < y check in addu), sorry this took that long to get to it. IMHO this is one of the modules that have to show source code readers how simple, robust and efficient code should be written in D.

denis-sh avatar Sep 22 '15 16:09 denis-sh

Looks good, maybe add a few microbenchmarks to https://github.com/D-Programming-Language/druntime/tree/master/benchmark.

MartinNowak avatar Sep 24 '15 21:09 MartinNowak

Why all the style changes? And because you specifically mention the module as a potential style showcase: I don't think we should encourage the use of e.g. bitwise not for boolean values, nor relying on operator precedence wherever possible.

Also, you should add your name after Walter's. He is still the primary author of the API.

dnadlinger avatar Sep 24 '15 21:09 dnadlinger

Looks good, maybe add a few microbenchmarks to

OK, I will able to do it probably tomorrow.

Why all the style changes?

Do you mean const additions, spaces after cast(X), and ulong(x) * ulong(y) to cast(ulong) x * y changes? If so, what exactly is the problem here?

And because you specifically mention the module as a potential style showcase: I don't think we should encourage the use of e.g. bitwise not for boolean values,

No boolean values are used. All types are size_t.

nor relying on operator precedence wherever possible.

Hm, I don't see any complicated cases in my code, the only thing used are a & b || c && d (it's known logical operators has lower priority compared to arithmetic/bitwise) and a >>> b + c (C++'s cin/cout taught us about these cases). Just a note: In contrast, previous file version had this: && ((r == y)? r : (r / x) != y), where the only probably needed for clarity parantheses pair is around r / x != y and is missed but three useless pairs are added just to complicate expression reading as (r == y) tells the reader ?: evaluates before == and then the reader may think the last != y compares the result of the whole previous text with y.

Also, you should add your name after Walter's. He is still the primary author of the API.

Sorry, I supposed it was alphabetically sorted for some reason. Fixed.

denis-sh avatar Sep 25 '15 08:09 denis-sh

As noted in the reference that was elided by this PR (and should be reinstated), doing these checks is tricky because of the ways some compilers generate code. The way I wrote the code was to avoid various code generators invalidating the results.

I'm not saying your method is wrong, but I do suggest a careful review of the unittests to ensure that it really does work correctly and is tested thoroughly.

WalterBright avatar Sep 25 '15 09:09 WalterBright

...doing these checks is tricky because of the ways some compilers generate code. The way I wrote the code was to avoid various code generators invalidating the results.

Do you mean there are code generation bugs and these bugs should be considered in every module we write? Because I don't see when this pull rely on some concrete implementation of UB case.

denis-sh avatar Sep 27 '15 08:09 denis-sh

return cast(size_t) (n >>> T.sizeof * 8 - 1);

Many compilers generate very good code for bool b = a < b;. I suggest that instead of the shift.

size_t _differentSign(T)(in T a, in T b)

should return a bool

Do you suppose _signBit should be n => n < 0 instead of n => cast(size_t) (n >>> T.sizeof * 8 - 1) and _differentSign should be (a, b) => a._signBit != b._signBit instead of (a, b) => a._signBit ^ b._signBit?

denis-sh avatar Sep 27 '15 08:09 denis-sh

  • Improved the original version by using (a ^ b)._signBit instead of a._signBit ^ b._signBit to save one shift operation.
  • Restored reference to silly blog post (still against this change).

Do you suppose _signBit should be n => n < 0 instead of n => cast(size_t) (n >>> T.sizeof * 8 - 1) and _differentSign should be (a, b) => a._signBit != b._signBit instead of (a, b) => a._signBit ^ b._signBit?

  • Added commit which rewrites implementation this way. The generated code still looks pretty good on GDC (just uses two conditional jumps instead of one) and very good on Clang ((a < 0) != (b < 0) is fully optimized but still there are two conditional jumps for a && b construction) (and sorry, didn't test LDC, only Clang).

denis-sh avatar Sep 27 '15 11:09 denis-sh

The generated code still looks pretty good on GDC (just uses two conditional jumps instead of one)

Such a shame that GDC ignores these function bodies then. ;-)

ibuclaw avatar Sep 27 '15 11:09 ibuclaw

LDC won't use those function bodies either, except on platforms where there is no native instruction (e.g. long on x86).

dnadlinger avatar Sep 28 '15 09:09 dnadlinger

This version of core.checkedint still passes my preciseops test suite, which is considerably more thorough than the druntime unittests.

Performance-wise, I find it a bit slower with DMD than the old version was.

tsbockman avatar Sep 29 '15 14:09 tsbockman

Performance-wise, I find it a bit slower with DMD than the old version was.

With DMD, could you check the first commit (the one without the use of logical operations) please?

denis-sh avatar Sep 29 '15 14:09 denis-sh

(I should note that my tests assume no one has gotten around to implementing intrinsics for core.checkedint on DMD yet; I need to retest if this is incorrect.)

The bitwise version was slightly slower than the logical version, even after I optimized it by combining some unnecessary shift operations.

tsbockman avatar Sep 29 '15 15:09 tsbockman

The bitwise version was slightly slower than the logical version, even after I optimized it by combining some unnecessary shift operations.

What functions exactly are you measuring? Is adds for long slower? I.e. is ~cast(size_t)((x ^ y) >>> 63) & cast(size_t)((x ^ r) >>> 63) slower than x < 0 && y < 0 && r >= 0 || x >= 0 && y >= 0 && r < 0?

denis-sh avatar Sep 29 '15 16:09 denis-sh

I am running the benchmark found in main.d of my CheckedInt project. It uses checked arithmetic to do various make-work calculations like prime finding, integer square root, the collatz sequence, and quicksort.

It was designed to test the performance of my checked integer type, but in the process it exercises all the contents of core.checkedint. On DMD, most of the runtime is taken up by other things. But, when holding other factors constant, the performance differences between various core.checkedint implementations are discernible.

I am not really a big fan of micro-benchmarks, because it is so hard to get them right. If you really want, I can take a stab at breaking things down in more detail later. However, I think it's probably a waste of time; the old core.checkedint implementation is already fast enough that, with a good optimizer (by which I mean GDC), the performance boost from enabling intrinsics is barely noticeable.

Compared to bitwise equivalents, the kind of test logic used in the old implementation looks slow, but in reality it is highly amenable to optimization and superscalar execution.

If this pull request is accepted, it should be for some reason other than runtime performance.

tsbockman avatar Sep 29 '15 16:09 tsbockman

If this pull request is accepted, it should be for some reason other than runtime performance.

Agree. I tried to improve the whole module implementation style.

denis-sh avatar Sep 29 '15 17:09 denis-sh

This is a && b / c != d, parenthesis really does help make it clear at a glance just exactly what it is doing.

Readded.

I would really rather put the && at the end of the first line

Changed as I'm probably the only one here using such style and ignoring Code Complete.

denis-sh avatar Sep 29 '15 17:09 denis-sh

I am not really a big fan of micro-benchmarks

Let's please add them anyhow. Number of instructions and latency is a good proxy, though you can hardly measure the impact on the branch cache.

MartinNowak avatar Sep 29 '15 17:09 MartinNowak

Looks good, maybe add a few microbenchmarks to https://github.com/D-Programming-Language/druntime/tree/master/benchmark.

Not sure about benchmarks. For now I only can describe resulting machine code.

This is what Clang does (one can see full (a < 0) != (b < 0) optimization applied):

adds(int, int, bool*):
    lea eax, [rsi + rdi]
    xor esi, edi
    js  .LBB0_3
    xor edi, eax
    jns .LBB0_3
    mov byte ptr [rdx], 1
.LBB0_3:
    ret

and this is what GCC does (just direct shifting and comparison, looks rather slow):

adds(int, int, bool*):
    leal    (%rdi,%rsi), %eax
    shrl    $31, %esi
    shrl    $31, %edi
    cmpb    %dil, %sil
    jne .L4
    movl    %eax, %ecx
    shrl    $31, %ecx
    cmpb    %cl, %sil
    je  .L4
    movb    $1, (%rdx)
.L4:
    rep ret

So yes, one have to use (a ^ b)._signBit or GCC will not optimize it (but if one does use ^, GCC will produce the same assembly as Clang's shown above).

denis-sh avatar Sep 29 '15 17:09 denis-sh

@MartinNowak

Looks good, maybe add a few microbenchmarks to https://github.com/D-Programming-Language/druntime/tree/master/benchmark.

Is there an equivalent standard benchmark for Phobos somewhere?

tsbockman avatar Sep 29 '15 17:09 tsbockman

~diffSign(x, y) && diffSign(x, r) is a wrong code if diffSign returns 1.

~~Further looking at resulting assembly is even more interesting:~~

~~~diffSign(x, y) && diffSign(x, r) gives us this with with Clang:~~

adds(int, int, bool*):
    add esi, edi               # looks unoptimal, could use EAX via lea
    xor edi, esi
    jns .LBB0_2
    mov byte ptr [rdx], 1
.LBB0_2:
    mov eax, esi               # as EAX wasn't used
    ret

~~and this with GCC (if (a ^ b)._signBit is used of course):~~

adds(int, int, bool*):
    lea eax, [rsi+rdi]
    xor edi, eax
    jns .L4
    mov BYTE PTR [rdx], 1
.L4:
    rep ret

~~GCC is the "winner" here. The instruction count winner who can't optimize (a < 0) != (b < 0).~~

~~Usage of ! instead of ~ or & instead of && increases instruction count, sometimes a lot.~~

denis-sh avatar Sep 29 '15 17:09 denis-sh

~~> Further looking at resulting assembly is even more interesting:~~

~~Good short assembly. Good except it's incorrect. Looks like it didn't take much time too find codegen bug.~~

denis-sh avatar Sep 29 '15 17:09 denis-sh

Further looking at resulting assembly is even more interesting:

Good short assembly. Good except it's incorrect. Looks like it didn't take much time too find codegen bug.

Sorry for the noise, just my example code was incorrect. ~1 isn't zero.

denis-sh avatar Sep 29 '15 18:09 denis-sh

Changed the first commit to use sign-extending shift to make ~diffSign(x, y) && diffSign(x, r) construction valid too because otherwise there should be an warning comment and I don't want to add one just for the first commit.

denis-sh avatar Sep 29 '15 18:09 denis-sh

Not sure about benchmarks.

Will try to add one in a couple of days.

denis-sh avatar Sep 29 '15 18:09 denis-sh

Is there an equivalent standard benchmark for Phobos somewhere?

Not that I'm aware of, but this PR is druntime code anyhow.

MartinNowak avatar Sep 29 '15 19:09 MartinNowak

@MartinNowak

Not that I'm aware of, but this PR is druntime code anyhow.

Sure. I was just wondering because it would be relevant to something else I'm working on.

tsbockman avatar Sep 29 '15 20:09 tsbockman

Ping @denis-sh

DmitryOlshansky avatar Mar 21 '16 09:03 DmitryOlshansky

Ping @denis-sh

version block styling issue or request for benchmark?

denis-sh avatar Mar 21 '16 18:03 denis-sh

version block styling issue or request for benchmark?

I think benchmarking is more important and was your last intention.

DmitryOlshansky avatar Mar 21 '16 18:03 DmitryOlshansky