druntime Rewrite `core.checkedint` functions.

Use fast and simple sign comparison instead of long, slow and complicated value comparisons.

Also remove reference as it doesn't apply anymore.

P.S.

I planned to rewrite this module implementation since its creation because it's clearly written in a hurry (e.g. see r < x || r < y check in addu), sorry this took that long to get to it. IMHO this is one of the modules that have to show source code readers how simple, robust and efficient code should be written in D.

Sep 22 '15 16:09 denis-sh

Looks good, maybe add a few microbenchmarks to https://github.com/D-Programming-Language/druntime/tree/master/benchmark.

Sep 24 '15 21:09 MartinNowak

Why all the style changes? And because you specifically mention the module as a potential style showcase: I don't think we should encourage the use of e.g. bitwise not for boolean values, nor relying on operator precedence wherever possible.

Also, you should add your name after Walter's. He is still the primary author of the API.

Sep 24 '15 21:09 dnadlinger

Looks good, maybe add a few microbenchmarks to

OK, I will able to do it probably tomorrow.

Why all the style changes?

Do you mean const additions, spaces after cast(X), and ulong(x) * ulong(y) to cast(ulong) x * y changes? If so, what exactly is the problem here?

And because you specifically mention the module as a potential style showcase: I don't think we should encourage the use of e.g. bitwise not for boolean values,

No boolean values are used. All types are size_t.

nor relying on operator precedence wherever possible.

Hm, I don't see any complicated cases in my code, the only thing used are a & b || c && d (it's known logical operators has lower priority compared to arithmetic/bitwise) and a >>> b + c (C++'s cin/cout taught us about these cases). Just a note: In contrast, previous file version had this: && ((r == y)? r : (r / x) != y), where the only probably needed for clarity parantheses pair is around r / x != y and is missed but three useless pairs are added just to complicate expression reading as (r == y) tells the reader ?: evaluates before == and then the reader may think the last != y compares the result of the whole previous text with y.

Also, you should add your name after Walter's. He is still the primary author of the API.

Sorry, I supposed it was alphabetically sorted for some reason. Fixed.

Sep 25 '15 08:09 denis-sh

As noted in the reference that was elided by this PR (and should be reinstated), doing these checks is tricky because of the ways some compilers generate code. The way I wrote the code was to avoid various code generators invalidating the results.

I'm not saying your method is wrong, but I do suggest a careful review of the unittests to ensure that it really does work correctly and is tested thoroughly.

Sep 25 '15 09:09 WalterBright

...doing these checks is tricky because of the ways some compilers generate code. The way I wrote the code was to avoid various code generators invalidating the results.

Do you mean there are code generation bugs and these bugs should be considered in every module we write? Because I don't see when this pull rely on some concrete implementation of UB case.

Sep 27 '15 08:09 denis-sh

return cast(size_t) (n >>> T.sizeof * 8 - 1);

Many compilers generate very good code for bool b = a < b;. I suggest that instead of the shift.

size_t _differentSign(T)(in T a, in T b)

should return a bool

Do you suppose _signBit should be n => n < 0 instead of n => cast(size_t) (n >>> T.sizeof * 8 - 1) and _differentSign should be (a, b) => a._signBit != b._signBit instead of (a, b) => a._signBit ^ b._signBit?

Sep 27 '15 08:09 denis-sh

Improved the original version by using (a ^ b)._signBit instead of a._signBit ^ b._signBit to save one shift operation.
Restored reference to silly blog post (still against this change).

Do you suppose _signBit should be n => n < 0 instead of n => cast(size_t) (n >>> T.sizeof * 8 - 1) and _differentSign should be (a, b) => a._signBit != b._signBit instead of (a, b) => a._signBit ^ b._signBit?

Added commit which rewrites implementation this way. The generated code still looks pretty good on GDC (just uses two conditional jumps instead of one) and very good on Clang ((a < 0) != (b < 0) is fully optimized but still there are two conditional jumps for a && b construction) (and sorry, didn't test LDC, only Clang).

Sep 27 '15 11:09 denis-sh

The generated code still looks pretty good on GDC (just uses two conditional jumps instead of one)

Such a shame that GDC ignores these function bodies then. ;-)

Sep 27 '15 11:09 ibuclaw

LDC won't use those function bodies either, except on platforms where there is no native instruction (e.g. long on x86).

Sep 28 '15 09:09 dnadlinger

This version of core.checkedint still passes my preciseops test suite, which is considerably more thorough than the druntime unittests.

Performance-wise, I find it a bit slower with DMD than the old version was.

Sep 29 '15 14:09 tsbockman

Performance-wise, I find it a bit slower with DMD than the old version was.

With DMD, could you check the first commit (the one without the use of logical operations) please?

Sep 29 '15 14:09 denis-sh

(I should note that my tests assume no one has gotten around to implementing intrinsics for core.checkedint on DMD yet; I need to retest if this is incorrect.)

The bitwise version was slightly slower than the logical version, even after I optimized it by combining some unnecessary shift operations.

Sep 29 '15 15:09 tsbockman

The bitwise version was slightly slower than the logical version, even after I optimized it by combining some unnecessary shift operations.

What functions exactly are you measuring? Is adds for long slower? I.e. is ~cast(size_t)((x ^ y) >>> 63) & cast(size_t)((x ^ r) >>> 63) slower than x < 0 && y < 0 && r >= 0 || x >= 0 && y >= 0 && r < 0?

Sep 29 '15 16:09 denis-sh

I am running the benchmark found in main.d of my CheckedInt project. It uses checked arithmetic to do various make-work calculations like prime finding, integer square root, the collatz sequence, and quicksort.

It was designed to test the performance of my checked integer type, but in the process it exercises all the contents of core.checkedint. On DMD, most of the runtime is taken up by other things. But, when holding other factors constant, the performance differences between various core.checkedint implementations are discernible.

I am not really a big fan of micro-benchmarks, because it is so hard to get them right. If you really want, I can take a stab at breaking things down in more detail later. However, I think it's probably a waste of time; the old core.checkedint implementation is already fast enough that, with a good optimizer (by which I mean GDC), the performance boost from enabling intrinsics is barely noticeable.

Compared to bitwise equivalents, the kind of test logic used in the old implementation looks slow, but in reality it is highly amenable to optimization and superscalar execution.

If this pull request is accepted, it should be for some reason other than runtime performance.

Sep 29 '15 16:09 tsbockman

If this pull request is accepted, it should be for some reason other than runtime performance.

Agree. I tried to improve the whole module implementation style.

Sep 29 '15 17:09 denis-sh

This is a && b / c != d, parenthesis really does help make it clear at a glance just exactly what it is doing.

Readded.

I would really rather put the && at the end of the first line

Changed as I'm probably the only one here using such style and ignoring Code Complete.

Sep 29 '15 17:09 denis-sh

I am not really a big fan of micro-benchmarks

Let's please add them anyhow. Number of instructions and latency is a good proxy, though you can hardly measure the impact on the branch cache.

Sep 29 '15 17:09 MartinNowak

Looks good, maybe add a few microbenchmarks to https://github.com/D-Programming-Language/druntime/tree/master/benchmark.

Not sure about benchmarks. For now I only can describe resulting machine code.

This is what Clang does (one can see full (a < 0) != (b < 0) optimization applied):

adds(int, int, bool*):
    lea eax, [rsi + rdi]
    xor esi, edi
    js  .LBB0_3
    xor edi, eax
    jns .LBB0_3
    mov byte ptr [rdx], 1
.LBB0_3:
    ret

and this is what GCC does (just direct shifting and comparison, looks rather slow):

adds(int, int, bool*):
    leal    (%rdi,%rsi), %eax
    shrl    $31, %esi
    shrl    $31, %edi
    cmpb    %dil, %sil
    jne .L4
    movl    %eax, %ecx
    shrl    $31, %ecx
    cmpb    %cl, %sil
    je  .L4
    movb    $1, (%rdx)
.L4:
    rep ret

So yes, one have to use (a ^ b)._signBit or GCC will not optimize it (but if one does use ^, GCC will produce the same assembly as Clang's shown above).

Sep 29 '15 17:09 denis-sh

@MartinNowak

Looks good, maybe add a few microbenchmarks to https://github.com/D-Programming-Language/druntime/tree/master/benchmark.

Is there an equivalent standard benchmark for Phobos somewhere?

Sep 29 '15 17:09 tsbockman

~diffSign(x, y) && diffSign(x, r) is a wrong code if diffSign returns 1.

~~Further looking at resulting assembly is even more interesting:~~

~~~diffSign(x, y) && diffSign(x, r) gives us this with with Clang:~~

adds(int, int, bool*):
    add esi, edi               # looks unoptimal, could use EAX via lea
    xor edi, esi
    jns .LBB0_2
    mov byte ptr [rdx], 1
.LBB0_2:
    mov eax, esi               # as EAX wasn't used
    ret

~~and this with GCC (if (a ^ b)._signBit is used of course):~~

adds(int, int, bool*):
    lea eax, [rsi+rdi]
    xor edi, eax
    jns .L4
    mov BYTE PTR [rdx], 1
.L4:
    rep ret

~~GCC is the "winner" here. The instruction count winner who can't optimize (a < 0) != (b < 0).~~

~~Usage of ! instead of ~ or & instead of && increases instruction count, sometimes a lot.~~

Sep 29 '15 17:09 denis-sh

~~> Further looking at resulting assembly is even more interesting:~~

~~Good short assembly. Good except it's incorrect. Looks like it didn't take much time too find codegen bug.~~

Sep 29 '15 17:09 denis-sh

Further looking at resulting assembly is even more interesting:

Good short assembly. Good except it's incorrect. Looks like it didn't take much time too find codegen bug.

Sorry for the noise, just my example code was incorrect. ~1 isn't zero.

Sep 29 '15 18:09 denis-sh

Changed the first commit to use sign-extending shift to make ~diffSign(x, y) && diffSign(x, r) construction valid too because otherwise there should be an warning comment and I don't want to add one just for the first commit.

Sep 29 '15 18:09 denis-sh

Not sure about benchmarks.

Will try to add one in a couple of days.

Sep 29 '15 18:09 denis-sh

Is there an equivalent standard benchmark for Phobos somewhere?

Not that I'm aware of, but this PR is druntime code anyhow.

Sep 29 '15 19:09 MartinNowak

@MartinNowak

Not that I'm aware of, but this PR is druntime code anyhow.

Sure. I was just wondering because it would be relevant to something else I'm working on.

Sep 29 '15 20:09 tsbockman

Ping @denis-sh

Mar 21 '16 09:03 DmitryOlshansky

Ping @denis-sh

version block styling issue or request for benchmark?

Mar 21 '16 18:03 denis-sh

version block styling issue or request for benchmark?

I think benchmarking is more important and was your last intention.

Mar 21 '16 18:03 DmitryOlshansky

druntime druntime copied to clipboard

Rewrite `core.checkedint` functions.

P.S.

druntime
druntime copied to clipboard