druntime
druntime copied to clipboard
Rewrite `core.checkedint` functions.
Use fast and simple sign comparison instead of long, slow and complicated value comparisons.
Also remove reference as it doesn't apply anymore.
P.S.
I planned to rewrite this module implementation since its creation because it's clearly written in a hurry (e.g. see r < x || r < y check in addu), sorry this took that long to get to it. IMHO this is one of the modules that have to show source code readers how simple, robust and efficient code should be written in D.
Looks good, maybe add a few microbenchmarks to https://github.com/D-Programming-Language/druntime/tree/master/benchmark.
Why all the style changes? And because you specifically mention the module as a potential style showcase: I don't think we should encourage the use of e.g. bitwise not for boolean values, nor relying on operator precedence wherever possible.
Also, you should add your name after Walter's. He is still the primary author of the API.
Looks good, maybe add a few microbenchmarks to
OK, I will able to do it probably tomorrow.
Why all the style changes?
Do you mean const additions, spaces after cast(X), and ulong(x) * ulong(y) to cast(ulong) x * y changes? If so, what exactly is the problem here?
And because you specifically mention the module as a potential style showcase: I don't think we should encourage the use of e.g. bitwise not for boolean values,
No boolean values are used. All types are size_t.
nor relying on operator precedence wherever possible.
Hm, I don't see any complicated cases in my code, the only thing used are a & b || c && d (it's known logical operators has lower priority compared to arithmetic/bitwise) and a >>> b + c (C++'s cin/cout taught us about these cases).
Just a note: In contrast, previous file version had this: && ((r == y)? r : (r / x) != y), where the only probably needed for clarity parantheses pair is around r / x != y and is missed but three useless pairs are added just to complicate expression reading as (r == y) tells the reader ?: evaluates before == and then the reader may think the last != y compares the result of the whole previous text with y.
Also, you should add your name after Walter's. He is still the primary author of the API.
Sorry, I supposed it was alphabetically sorted for some reason. Fixed.
As noted in the reference that was elided by this PR (and should be reinstated), doing these checks is tricky because of the ways some compilers generate code. The way I wrote the code was to avoid various code generators invalidating the results.
I'm not saying your method is wrong, but I do suggest a careful review of the unittests to ensure that it really does work correctly and is tested thoroughly.
...doing these checks is tricky because of the ways some compilers generate code. The way I wrote the code was to avoid various code generators invalidating the results.
Do you mean there are code generation bugs and these bugs should be considered in every module we write? Because I don't see when this pull rely on some concrete implementation of UB case.
return cast(size_t) (n >>> T.sizeof * 8 - 1);
Many compilers generate very good code for
bool b = a < b;. I suggest that instead of the shift.
size_t _differentSign(T)(in T a, in T b)
should return a
bool
Do you suppose _signBit should be n => n < 0 instead of n => cast(size_t) (n >>> T.sizeof * 8 - 1) and _differentSign should be (a, b) => a._signBit != b._signBit instead of (a, b) => a._signBit ^ b._signBit?
- Improved the original version by using
(a ^ b)._signBitinstead ofa._signBit ^ b._signBitto save one shift operation. - Restored reference to silly blog post (still against this change).
Do you suppose
_signBitshould ben => n < 0instead ofn => cast(size_t) (n >>> T.sizeof * 8 - 1)and_differentSignshould be(a, b) => a._signBit != b._signBitinstead of(a, b) => a._signBit ^ b._signBit?
- Added commit which rewrites implementation this way. The generated code still looks pretty good on GDC (just uses two conditional jumps instead of one) and very good on Clang (
(a < 0) != (b < 0)is fully optimized but still there are two conditional jumps fora && bconstruction) (and sorry, didn't test LDC, only Clang).
The generated code still looks pretty good on GDC (just uses two conditional jumps instead of one)
Such a shame that GDC ignores these function bodies then. ;-)
LDC won't use those function bodies either, except on platforms where there is no native instruction (e.g. long on x86).
This version of core.checkedint still passes my preciseops test suite, which is considerably more thorough than the druntime unittests.
Performance-wise, I find it a bit slower with DMD than the old version was.
Performance-wise, I find it a bit slower with DMD than the old version was.
With DMD, could you check the first commit (the one without the use of logical operations) please?
(I should note that my tests assume no one has gotten around to implementing intrinsics for core.checkedint on DMD yet; I need to retest if this is incorrect.)
The bitwise version was slightly slower than the logical version, even after I optimized it by combining some unnecessary shift operations.
The bitwise version was slightly slower than the logical version, even after I optimized it by combining some unnecessary shift operations.
What functions exactly are you measuring? Is adds for long slower? I.e. is ~cast(size_t)((x ^ y) >>> 63) & cast(size_t)((x ^ r) >>> 63) slower than x < 0 && y < 0 && r >= 0 || x >= 0 && y >= 0 && r < 0?
I am running the benchmark found in main.d of my CheckedInt project. It uses checked arithmetic to do various make-work calculations like prime finding, integer square root, the collatz sequence, and quicksort.
It was designed to test the performance of my checked integer type, but in the process it exercises all the contents of core.checkedint. On DMD, most of the runtime is taken up by other things. But, when holding other factors constant, the performance differences between various core.checkedint implementations are discernible.
I am not really a big fan of micro-benchmarks, because it is so hard to get them right. If you really want, I can take a stab at breaking things down in more detail later. However, I think it's probably a waste of time; the old core.checkedint implementation is already fast enough that, with a good optimizer (by which I mean GDC), the performance boost from enabling intrinsics is barely noticeable.
Compared to bitwise equivalents, the kind of test logic used in the old implementation looks slow, but in reality it is highly amenable to optimization and superscalar execution.
If this pull request is accepted, it should be for some reason other than runtime performance.
If this pull request is accepted, it should be for some reason other than runtime performance.
Agree. I tried to improve the whole module implementation style.
This is
a && b / c != d, parenthesis really does help make it clear at a glance just exactly what it is doing.
Readded.
I would really rather put the && at the end of the first line
Changed as I'm probably the only one here using such style and ignoring Code Complete.
I am not really a big fan of micro-benchmarks
Let's please add them anyhow. Number of instructions and latency is a good proxy, though you can hardly measure the impact on the branch cache.
Looks good, maybe add a few microbenchmarks to https://github.com/D-Programming-Language/druntime/tree/master/benchmark.
Not sure about benchmarks. For now I only can describe resulting machine code.
This is what Clang does (one can see full (a < 0) != (b < 0) optimization applied):
adds(int, int, bool*):
lea eax, [rsi + rdi]
xor esi, edi
js .LBB0_3
xor edi, eax
jns .LBB0_3
mov byte ptr [rdx], 1
.LBB0_3:
ret
and this is what GCC does (just direct shifting and comparison, looks rather slow):
adds(int, int, bool*):
leal (%rdi,%rsi), %eax
shrl $31, %esi
shrl $31, %edi
cmpb %dil, %sil
jne .L4
movl %eax, %ecx
shrl $31, %ecx
cmpb %cl, %sil
je .L4
movb $1, (%rdx)
.L4:
rep ret
So yes, one have to use (a ^ b)._signBit or GCC will not optimize it (but if one does use ^, GCC will produce the same assembly as Clang's shown above).
@MartinNowak
Looks good, maybe add a few microbenchmarks to https://github.com/D-Programming-Language/druntime/tree/master/benchmark.
Is there an equivalent standard benchmark for Phobos somewhere?
~diffSign(x, y) && diffSign(x, r) is a wrong code if diffSign returns 1.
~~Further looking at resulting assembly is even more interesting:~~
~~~diffSign(x, y) && diffSign(x, r) gives us this with with Clang:~~
adds(int, int, bool*):
add esi, edi # looks unoptimal, could use EAX via lea
xor edi, esi
jns .LBB0_2
mov byte ptr [rdx], 1
.LBB0_2:
mov eax, esi # as EAX wasn't used
ret
~~and this with GCC (if (a ^ b)._signBit is used of course):~~
adds(int, int, bool*):
lea eax, [rsi+rdi]
xor edi, eax
jns .L4
mov BYTE PTR [rdx], 1
.L4:
rep ret
~~GCC is the "winner" here. The instruction count winner who can't optimize (a < 0) != (b < 0).~~
~~Usage of ! instead of ~ or & instead of && increases instruction count, sometimes a lot.~~
~~> Further looking at resulting assembly is even more interesting:~~
~~Good short assembly. Good except it's incorrect. Looks like it didn't take much time too find codegen bug.~~
Further looking at resulting assembly is even more interesting:
Good short assembly. Good except it's incorrect. Looks like it didn't take much time too find codegen bug.
Sorry for the noise, just my example code was incorrect. ~1 isn't zero.
Changed the first commit to use sign-extending shift to make ~diffSign(x, y) && diffSign(x, r) construction valid too because otherwise there should be an warning comment and I don't want to add one just for the first commit.
Not sure about benchmarks.
Will try to add one in a couple of days.
Is there an equivalent standard benchmark for Phobos somewhere?
Not that I'm aware of, but this PR is druntime code anyhow.
@MartinNowak
Not that I'm aware of, but this PR is druntime code anyhow.
Sure. I was just wondering because it would be relevant to something else I'm working on.
Ping @denis-sh
Ping @denis-sh
version block styling issue or request for benchmark?
version block styling issue or request for benchmark?
I think benchmarking is more important and was your last intention.