simdcsv
simdcsv copied to clipboard
[+] auto-unroll
@geofflangdale @lemire
great stuff, guys! too bad the project went stale.
here's what i got:
- fix manually unrolled loops in
flatten_bits()
which generate suboptimal code (clang -O3 auto-unrolls better, tested on both intel and arm, a noticeable improvement) - buffering rounds bumped to 8 (seems a dash better than 4 suggested by Daniel)
- some cosmetics (un-hardcoded separator char etc)
for more discussion regarding #1, see my C port https://github.com/kelas/ld with disassembly listings:
$ cd ld && make unroll nounroll
or just: https://github.com/kelas/ld/blob/master/zip.UNROLL https://github.com/kelas/ld/blob/master/zip.NOUNROLL
PS. I just discovered that @lemire also attempted to make flatten_bits()
a bit prettier, but as it shows compiler does the right thing without much heroics.
fix manually unrolled loops in flatten_bits() which generate suboptimal code
I think you are saying that the compiler under some setting unrolls better than manually unrolled code.
I do not think that you are saying that this new code with C macros...
#define z(i) base_ptr[base + i] = static_cast<uint32_t>(idx) + trailingzeroes(bits); bits = bits & (bits - 1);
really_inline void flatten_bits(uint32_t *base_ptr, uint32_t &base,
uint32_t idx, uint64_t bits) {
if (bits != 0u) {
uint32_t cnt = hamming(bits);
uint32_t next_base = base + cnt
z(0)z(1)z(2)z(3)z(4)z(5)z(6)z(7)
... will be significantly better than the existing code...
really_inline void flatten_bits(uint32_t *base_ptr, uint32_t &base,
uint32_t idx, uint64_t bits) {
if (bits != 0u) {
uint32_t cnt = hamming(bits);
uint32_t next_base = base + cnt;
base_ptr[base + 0] = static_cast<uint32_t>(idx) + trailingzeroes(bits);
bits = bits & (bits - 1);
base_ptr[base + 1] = static_cast<uint32_t>(idx) + trailingzeroes(bits);
bits = bits & (bits - 1);
base_ptr[base + 2] = static_cast<uint32_t>(idx) + trailingzeroes(bits);
bits = bits & (bits - 1);
base_ptr[base + 3] = static_cast<uint32_t>(idx) + trailingzeroes(bits);
bits = bits & (bits - 1);
base_ptr[base + 4] = static_cast<uint32_t>(idx) + trailingzeroes(bits);
bits = bits & (bits - 1);
base_ptr[base + 5] = static_cast<uint32_t>(idx) + trailingzeroes(bits);
bits = bits & (bits - 1);
base_ptr[base + 6] = static_cast<uint32_t>(idx) + trailingzeroes(bits);
bits = bits & (bits - 1);
base_ptr[base + 7] = static_cast<uint32_t>(idx) + trailingzeroes(bits);
bits = bits & (bits - 1);
If not, why introduce a macro ?
Macros tend to harm readability and maintainability as you know.
Effectively, you are saying that you are getting better results without the manual unrolling. That's interesting.
@kelas
Assuming that you have access to Linux or an ARM-based Apple device, can you do the following and report back on the output ? (The sudo
may not be necessary, depending on your configuration.)
git clone [email protected]:lemire/Code-used-on-Daniel-Lemire-s-blog.git
cd Code-used-on-Daniel-Lemire-s-blog/2019/05/03
make && sudo ./bitmapdecoding
The simdjson_decoder_rolled
function is my adaptation of your idea (let us not unroll manually) and simdjson_decoder
is our function.
Here is what I get... (Apple ARM, clang 12)
simdjson_decoder:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
11.58 instructions/match (+/- 0.0 %) 2.82 cycles/match (+/- 0.1 %)
simdjson_decoder_rolled:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
11.58 instructions/match (+/- 0.0 %) 2.82 cycles/match (+/- 0.1 %)
And with GCC 8 under Zen 2...
simdjson_decoder:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
instructions per cycle 1.88, cycles per value set: 4.216, instructions per value set: 7.946, cycles per word: 31.200, instructions per word: 58.800
cycles per input byte 0.49 instructions per input byte 0.92
min: 312 cycles, 588 instructions, 2 branch mis., 0 cache ref., 0 cache mis.
avg: 388.7 cycles, 588.0 instructions, 4.1 branch mis., 0.0 cache ref., 0.0 cache mis.
simdjson_decoder_rolled:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
instructions per cycle 1.88, cycles per value set: 4.230, instructions per value set: 7.946, cycles per word: 31.300, instructions per word: 58.800
cycles per input byte 0.49 instructions per input byte 0.92
min: 313 cycles, 588 instructions, 2 branch mis., 0 cache ref., 0 cache mis.
avg: 471.0 cycles, 588.1 instructions, 4.4 branch mis., 0.0 cache ref., 0.0 cache mis.
As you can see, I get nothing close to a 30% performance gain when avoiding the manual unrolling. In fact, I get exactly the same results down to the instruction count.
BTW here is the current code in simdjson (the basis for simdcsv):
if (bits == 0)
return;
int cnt = static_cast<int>(count_ones(bits));
// Do the first 8 all together
for (int i=0; i<8; i++) {
this->tail[i] = idx + trailing_zeroes(bits);
bits = clear_lowest_bit(bits);
}
// Do the next 8 all together (we hope in most cases it won't happen at all
// and the branch is easily predicted).
if (simdjson_unlikely(cnt > 8)) {
for (int i=8; i<16; i++) {
this->tail[i] = idx + trailing_zeroes(bits);
bits = clear_lowest_bit(bits);
}
// Most files don't have 16+ structurals per block, so we take several basically guaranteed
// branch mispredictions here. 16+ structurals per block means either punctuation ({} [] , :)
// or the start of a value ("abc" true 123) every four characters.
if (simdjson_unlikely(cnt > 16)) {
int i = 16;
do {
this->tail[i] = idx + trailing_zeroes(bits);
bits = clear_lowest_bit(bits);
i++;
} while (i < cnt);
}
}
this->tail += cnt;
}
};
Macros tend to harm readability and maintainability as you know.
depends on your choice of macros :) this particular style indeed doesn't match the neighbourhood, it's a lazy copypaste from the C port, which is 20 lines of code total. i have no problem rewriting those two loops the loopy way, will do that later.
Effectively, you are saying that you are getting better results without the manual unrolling. That's interesting.
correct, i believe that's what i'm saying.
gcc8
sorry, i don't have access to legacy compiler tech. compilers do get better!
30% performance gain when avoiding the manual unrolling
affirmative. if you saw the disasm listings included in PR, the gist of it is this:
- clang12 and gcc10 auto-unroll each 8-step loop to this:
1000044c4: 41 89 c2 movl %eax, %r10d
1000044c7: 31 f6 xorl %esi, %esi
1000044c9: f3 48 0f bc f1 tzcntq %rcx, %rsi
1000044ce: 01 d6 addl %edx, %esi
1000044d0: 42 89 34 97 movl %esi, (%rdi,%r10,4)
1000044d4: c4 e2 f0 f3 c9 blsrq %rcx, %rcx
1000044d9: 31 f6 xorl %esi, %esi
1000044db: f3 48 0f bc f1 tzcntq %rcx, %rsi
1000044e0: 01 d6 addl %edx, %esi
1000044e2: 42 89 74 97 04 movl %esi, 4(%rdi,%r10,4)
1000044e7: c4 e2 f0 f3 c9 blsrq %rcx, %rcx
1000044ec: 31 f6 xorl %esi, %esi
1000044ee: f3 48 0f bc f1 tzcntq %rcx, %rsi
1000044f3: 01 d6 addl %edx, %esi
1000044f5: 42 89 74 97 08 movl %esi, 8(%rdi,%r10,4)
1000044fa: c4 e2 f0 f3 c9 blsrq %rcx, %rcx
1000044ff: 31 f6 xorl %esi, %esi
100004501: f3 48 0f bc f1 tzcntq %rcx, %rsi
100004506: 01 d6 addl %edx, %esi
100004508: 42 89 74 97 0c movl %esi, 12(%rdi,%r10,4)
- whereas manual unroll heroics compile to this:
1000044c2: 44 89 14 b7 movl %r10d, (%rdi,%rsi,4)
1000044c6: c4 e2 a8 f3 c9 blsrq %rcx, %r10
1000044cb: 31 f6 xorl %esi, %esi
1000044cd: f3 49 0f bc f2 tzcntq %r10, %rsi
1000044d2: 01 d6 addl %edx, %esi
1000044d4: 8d 48 01 leal 1(%rax), %ecx //!< here be dragons
1000044d7: 89 34 8f movl %esi, (%rdi,%rcx,4)
1000044da: c4 c2 a8 f3 ca blsrq %r10, %r10
1000044df: 31 f6 xorl %esi, %esi
1000044e1: f3 49 0f bc f2 tzcntq %r10, %rsi
1000044e6: 01 d6 addl %edx, %esi
1000044e8: 8d 48 02 leal 2(%rax), %ecx //!< more here
1000044eb: 89 34 8f movl %esi, (%rdi,%rcx,4)
1000044ee: c4 c2 a8 f3 ca blsrq %r10, %r10
1000044f3: 31 f6 xorl %esi, %esi
1000044f5: f3 49 0f bc f2 tzcntq %r10, %rsi
1000044fa: 01 d6 addl %edx, %esi
1000044fc: 8d 48 03 leal 3(%rax), %ecx //!< ad infinitum
i'm not sure where exactly the useless lea
comes from, and tbh i don't want to know - sometimes it's ok to let the compiler do the right thing.
hope this helps k.
Sent from my Imperial Star Destroyer
yup from the looks of it simdjson version would get auto-unrolled like a boss. blue skies.
but yeah - i think it is the first time i've hit a major performance degradation from an otherwise perfectly braindead manually unrolled loop.
them compiler writers know their thing, eh.
On Fri, 16 Apr 2021 at 17:51, Daniel Lemire @.***> wrote:
BTW here is the current code in simdjson (the basis for simdcsv):
if (bits == 0) return; int cnt = static_cast<int>(count_ones(bits)); // Do the first 8 all together for (int i=0; i<8; i++) { this->tail[i] = idx + trailing_zeroes(bits); bits = clear_lowest_bit(bits); } // Do the next 8 all together (we hope in most cases it won't happen at all // and the branch is easily predicted). if (simdjson_unlikely(cnt > 8)) { for (int i=8; i<16; i++) { this->tail[i] = idx + trailing_zeroes(bits); bits = clear_lowest_bit(bits); } // Most files don't have 16+ structurals per block, so we take several basically guaranteed // branch mispredictions here. 16+ structurals per block means either punctuation ({} [] , :) // or the start of a value ("abc" true 123) every four characters. if (simdjson_unlikely(cnt > 16)) { int i = 16; do { this->tail[i] = idx + trailing_zeroes(bits); bits = clear_lowest_bit(bits); i++; } while (i < cnt); } } this->tail += cnt;
} };
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geofflangdale/simdcsv/pull/8#issuecomment-821271498, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALVXXV7T7LE7F5E56J5D73TJBMIXANCNFSM43A76SYQ .
-- k.
Sent from my Imperial Star Destroyer
i don't have access to legacy compiler tech.
If you follow... https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2019/05/03
You will notice that I offer Docker configurations and scripts allowing you to run it with the compilers I had used back in 2019.
Numbers with GCC 10 and Zen 2:
simdjson_decoder:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
instructions per cycle 1.82, cycles per value set: 4.392, instructions per value set: 8.000, cycles per word: 32.500, instructions per word: 59.200
cycles per input byte 0.51 instructions per input byte 0.93
min: 325 cycles, 592 instructions, 2 branch mis., 0 cache ref., 0 cache mis.
avg: 370.2 cycles, 592.0 instructions, 3.8 branch mis., 0.0 cache ref., 0.0 cache mis.
simdjson_decoder_rolled:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
instructions per cycle 1.84, cycles per value set: 4.338, instructions per value set: 8.000, cycles per word: 32.100, instructions per word: 59.200
cycles per input byte 0.50 instructions per input byte 0.93
min: 321 cycles, 592 instructions, 2 branch mis., 0 cache ref., 0 cache mis.
avg: 364.8 cycles, 592.0 instructions, 3.8 branch mis., 0.0 cache ref., 0.0 cache mis.
Numbers with Zen 2 and LLVM clang 9:
simdjson_decoder:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
instructions per cycle 1.72, cycles per value set: 4.811, instructions per value set: 8.297, cycles per word: 35.600, instructions per word: 61.400
cycles per input byte 0.56 instructions per input byte 0.96
min: 356 cycles, 614 instructions, 3 branch mis., 0 cache ref., 0 cache mis.
avg: 414.6 cycles, 614.0 instructions, 4.2 branch mis., 0.0 cache ref., 0.0 cache mis.
simdjson_decoder_rolled:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
instructions per cycle 1.69, cycles per value set: 4.905, instructions per value set: 8.297, cycles per word: 36.300, instructions per word: 61.400
cycles per input byte 0.57 instructions per input byte 0.96
min: 363 cycles, 614 instructions, 3 branch mis., 0 cache ref., 0 cache mis.
avg: 425.0 cycles, 614.0 instructions, 4.1 branch mis., 0.0 cache ref., 0.0 cache mis.
GCC 10 is current. GCC 8 came out in 2018.
30% performance gain when avoiding the manual unrolling (...) affirmative.
I submit to you that if the running time is 30% higher then you have discovered a bug with LLVM and you should report it to LLVM.
Here my results with LLVM clang 13 on Zen 2:
just scanning:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
instructions per cycle 0.47, cycles per value set: 1.622, instructions per value set: 0.770, cycles per word: 12.000, instructions per word: 5.700
cycles per input byte 0.19 instructions per input byte 0.09
min: 120 cycles, 57 instructions, 3 branch mis., 0 cache ref., 0 cache mis.
avg: 142.1 cycles, 57.0 instructions, 3.0 branch mis., 0.0 cache ref., 0.0 cache mis.
simdjson_decoder:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
instructions per cycle 1.57, cycles per value set: 5.068, instructions per value set: 7.932, cycles per word: 37.500, instructions per word: 58.700
cycles per input byte 0.59 instructions per input byte 0.92
min: 375 cycles, 587 instructions, 3 branch mis., 0 cache ref., 0 cache mis.
avg: 428.4 cycles, 587.0 instructions, 5.2 branch mis., 0.0 cache ref., 0.0 cache mis.
simdjson_decoder_rolled:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
instructions per cycle 1.62, cycles per value set: 4.892, instructions per value set: 7.932, cycles per word: 36.200, instructions per word: 58.700
cycles per input byte 0.57 instructions per input byte 0.92
min: 382 cycles, 587 instructions, 3 branch mis., 0 cache ref., 0 cache mis.
avg: 425.8 cycles, 587.0 instructions, 5.0 branch mis., 0.0 cache ref., 0.0 cache mis.
So I get the same results (exactly same number of instructions) with both functions.
You can use Dockerfile.llvmlatest
to reproduce the results.
You can use
Dockerfile.llvmlatest
to reproduce the results.
@lemire - do you mean this Dockerfile.llvmlatest?
You can use Dockerfile.llvmlatest
do you mean this Dockerfile.llvmlatest
Yes.
It appears that @kelas rediscovered an optimization that has been applied to simdjson a few years ago. I still find it puzzling that an auto-unrolled loop results in a better code than its repeated body.
time 30% higher
it actually seems to fluctuate between 10% and 25%, i guess depending on the cost of lea
instruction on a given microarch.
then you have discovered a bug with LLVM and you should report it to LLVM.
this is a bold statement, Daniel. both gcc10 and clang12 (-O3) auto-unroll these two loops without extra crud, and work faster. this is a double-blind validated result. sorry.
i therefore submit to you that this day and age manual loop wizardry from the language userland needs to be carefully profiled and validated. In this particular case, two pages of copy-pasted static_cast
didn't do much good.
(also results in style which looks a bit watery to some, but tastes differ. some people hate macros.)
it's also reassuring to hear that simdjson was relieved of this optimisation at some point. great project by the way, keep up the good work.
k.
According to @jkeiser, de-unrolling did not affect performance and was applied for purely aesthetic reasons. See the original Simplify and optimize flatten_bits patch.
purely aesthetic reasons.
thanks for digging that out, alex! i second that, i originally de-unrolled both loops in the C port just because i didn't want to have zzzzzzzz in my code, but then i profiled it, and i was in for a little surprise.
@abalkin
Most compilers at the highest level of optimization will unroll short simple loops... short is about 16 iterations. Note that this is sensitive to the optimization flags.
If you compile with -Os or -O2, then you will not get unrolling (GCC will not unroll at all at -O2). Both the -Os or -O2 flags are fairly popular and conventional.
If you really want unrolling, you better do it yourself in the code.
Note here that if you look at the code, even with the short loops from 0 to 8, it is still effectively unrolled. The no-unrolling would be like so...
do {
this->tail[i] = idx + trailing_zeroes(bits);
bits = clear_lowest_bit(bits);
i++;
} while (i < cnt);
Empirically, it proves beneficial within simdjson to do the song and dance that you see.
It appears that @kelas rediscovered an optimization that has been applied to simdjson a few years ago
I do not recall it this way. It was not an optimization. Rather the author of this contribution noticed that when building release builds with CMake, we got the needed unrolling. There was never any claim that it improved the performance.
If you really want unrolling, you better do it yourself in the code.
this little hiccup at hand provides ample evidence that it just ain't so. a human getting in the way of a modern optimising C/C++ compiler has every chance to end up being the bottleneck.
I do not recall it this way. It was not an optimization
still the jumbo commit description claims 15% improvement of stage 1 code, which i trust is some kind of early tokenization pass. hard to know what transpired, and what compilers were at play.
There was never any claim that it improved the performance.
there is one now.
@kelas
this is a bold statement, Daniel
I disagree. I carefully provided multiple reproducible tests, using multiple compilers and more than one processor, using CPU performance counters, showing no difference.
I am getting perfectly comparable binaries (as measured by performance counters) and the assembly outputs are what I am expecting.
You are reporting that both GNU GCC and LLVM clang produce anomalous code that harms the performance. I submit to you that this should be reported.
Here is the clang 12 assembly produced with the unrolled code:
flatten_bits(unsigned int*, unsigned int&, unsigned int, unsigned long): # @flatten_bits(unsigned int*, unsigned int&, unsigned int, unsigned long)
test rcx, rcx
je .LBB0_6
popcnt r9, rcx
mov r10d, dword ptr [rsi]
lea r8d, [r10 + r9]
tzcnt rax, rcx
add eax, edx
mov dword ptr [rdi + 4*r10], eax
blsr r10, rcx
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
inc eax
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 2
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 3
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 4
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 5
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 6
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor eax, eax
tzcnt rax, r10
add eax, edx
mov ecx, dword ptr [rsi]
add ecx, 7
mov dword ptr [rdi + 4*rcx], eax
cmp r9d, 9
jb .LBB0_5
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 8
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 9
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 10
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 11
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 12
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 13
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 14
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor eax, eax
tzcnt rax, r10
add eax, edx
mov ecx, dword ptr [rsi]
add ecx, 15
mov dword ptr [rdi + 4*rcx], eax
cmp r9d, 17
jb .LBB0_5
blsr rax, r10
mov ecx, dword ptr [rsi]
add ecx, 16
mov dword ptr [rsi], ecx
.LBB0_4: # =>This Inner Loop Header: Depth=1
xor r9d, r9d
tzcnt r9, rax
add r9d, edx
mov ecx, ecx
mov dword ptr [rdi + 4*rcx], r9d
mov ecx, dword ptr [rsi]
inc ecx
blsr rax, rax
mov dword ptr [rsi], ecx
jne .LBB0_4
.LBB0_5:
mov dword ptr [rsi], r8d
.LBB0_6:
ret
The core sequence is something like this...
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 8
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
Here is the GCC 10 code:
flatten_bits(unsigned int*, unsigned int&, unsigned int, unsigned long):
test rcx, rcx
je .L12
xor r10d, r10d
mov eax, DWORD PTR [rsi]
xor r8d, r8d
tzcnt r10, rcx
popcnt r8, rcx
add r10d, edx
lea r9d, [rax+r8]
mov DWORD PTR [rdi+rax*4], r10d
blsr rax, rcx
mov r10d, DWORD PTR [rsi]
xor ecx, ecx
tzcnt rcx, rax
blsr rax, rax
inc r10d
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 2
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 3
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 4
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 5
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 6
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
add r10d, 7
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
cmp r8d, 8
jle .L4
mov r10d, DWORD PTR [rsi]
xor ecx, ecx
blsr rax, rax
tzcnt rcx, rax
blsr rax, rax
add r10d, 8
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 9
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 10
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 11
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 12
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 13
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 14
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 15
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
cmp r8d, 16
jle .L4
mov ecx, DWORD PTR [rsi]
add ecx, 16
mov DWORD PTR [rsi], ecx
.L5:
xor r8d, r8d
mov ecx, ecx
tzcnt r8, rax
add r8d, edx
mov DWORD PTR [rdi+rcx*4], r8d
mov ecx, DWORD PTR [rsi]
inc ecx
blsr rax, rax
mov DWORD PTR [rsi], ecx
jne .L5
.L4:
mov DWORD PTR [rsi], r9d
ret
.L12:
ret
Both compiled with -O3
.
Link: https://godbolt.org/z/5sn1WEMen
Now here is the 'rolled version' (same compiler setting). Clang.
flatten_bits(unsigned int*, unsigned int&, unsigned int, unsigned long): # @flatten_bits(unsigned int*, unsigned int&, unsigned int, unsigned long)
test rcx, rcx
je .LBB0_6
popcnt r8, rcx
mov r9d, dword ptr [rsi]
tzcnt rax, rcx
add eax, edx
mov dword ptr [rdi + 4*r9], eax
blsr r10, rcx
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
inc eax
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 2
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 3
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 4
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 5
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 6
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor eax, eax
tzcnt rax, r10
add eax, edx
mov ecx, dword ptr [rsi]
add ecx, 7
mov dword ptr [rdi + 4*rcx], eax
add r9d, r8d
cmp r8d, 9
jb .LBB0_5
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 8
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 9
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 10
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 11
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 12
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 13
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 14
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 15
mov dword ptr [rdi + 4*rax], ecx
cmp r8d, 17
jb .LBB0_5
blsr rax, r10
mov ecx, dword ptr [rsi]
add ecx, 16
mov dword ptr [rsi], ecx
.LBB0_4: # =>This Inner Loop Header: Depth=1
xor r8d, r8d
tzcnt r8, rax
add r8d, edx
mov ecx, ecx
mov dword ptr [rdi + 4*rcx], r8d
mov ecx, dword ptr [rsi]
inc ecx
blsr rax, rax
mov dword ptr [rsi], ecx
jne .LBB0_4
.LBB0_5:
mov dword ptr [rsi], r9d
.LBB0_6:
ret
The core sequence is
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
inc eax
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
It is distinct from what we got in the rolled case, but barely, recall what we add...
tzcnt rcx, r10
add ecx, edx
mov eax, dword ptr [rsi]
add eax, 8
mov dword ptr [rdi + 4*rax], ecx
blsr r10, r10
xor ecx, ecx
Importantly, the instruction count is the same.
And then GCC
flatten_bits(unsigned int*, unsigned int&, unsigned int, unsigned long):
test rcx, rcx
je .L12
xor r10d, r10d
mov eax, DWORD PTR [rsi]
xor r8d, r8d
tzcnt r10, rcx
popcnt r8, rcx
add r10d, edx
lea r9d, [rax+r8]
mov DWORD PTR [rdi+rax*4], r10d
blsr rax, rcx
mov r10d, DWORD PTR [rsi]
xor ecx, ecx
tzcnt rcx, rax
blsr rax, rax
inc r10d
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 2
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 3
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 4
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 5
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 6
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 7
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
cmp r8d, 8
jle .L4
mov r10d, DWORD PTR [rsi]
xor ecx, ecx
tzcnt rcx, rax
blsr rax, rax
add r10d, 8
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 9
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 10
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 11
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 12
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 13
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
add r10d, 14
add ecx, edx
blsr rax, rax
mov DWORD PTR [rdi+r10*4], ecx
xor ecx, ecx
mov r10d, DWORD PTR [rsi]
tzcnt rcx, rax
blsr rax, rax
add r10d, 15
add ecx, edx
mov DWORD PTR [rdi+r10*4], ecx
cmp r8d, 16
jle .L4
mov ecx, DWORD PTR [rsi]
add ecx, 16
mov DWORD PTR [rsi], ecx
.L5:
xor r8d, r8d
mov ecx, ecx
tzcnt r8, rax
add r8d, edx
mov DWORD PTR [rdi+rcx*4], r8d
mov ecx, DWORD PTR [rsi]
inc ecx
blsr rax, rax
mov DWORD PTR [rsi], ecx
jne .L5
.L4:
mov DWORD PTR [rsi], r9d
ret
.L12:
ret
If you switch to -O2 with GCC and the rolled code, you get this:
flatten_bits(unsigned int*, unsigned int&, unsigned int, unsigned long):
mov r8d, edx
test rcx, rcx
je .L16
mov r9d, DWORD PTR [rsi]
xor r11d, r11d
xor eax, eax
popcnt r11, rcx
lea r10d, [r9+r11]
jmp .L4
.L18:
mov r9d, DWORD PTR [rsi]
.L4:
xor edx, edx
add r9d, eax
inc eax
tzcnt rdx, rcx
blsr rcx, rcx
add edx, r8d
mov DWORD PTR [rdi+r9*4], edx
cmp eax, 8
jne .L18
cmp r11d, 8
jle .L6
.L5:
xor edx, edx
mov r9d, eax
inc eax
add r9d, DWORD PTR [rsi]
tzcnt rdx, rcx
blsr rcx, rcx
add edx, r8d
mov DWORD PTR [rdi+r9*4], edx
cmp eax, 16
jne .L5
cmp r11d, 16
jle .L6
mov eax, DWORD PTR [rsi]
add eax, 16
mov DWORD PTR [rsi], eax
.L7:
xor edx, edx
mov eax, eax
tzcnt rdx, rcx
add edx, r8d
mov DWORD PTR [rdi+rax*4], edx
mov eax, DWORD PTR [rsi]
inc eax
blsr rcx, rcx
mov DWORD PTR [rsi], eax
jne .L7
.L6:
mov DWORD PTR [rsi], r10d
ret
.L16:
ret
@kelas
there is one now.
Yes. You are making this claim.
I find the evidence lacking.
It does not mean that your claim is not interesting. I'd love to document such a case myself. I love surprising things.
Can you provide me with a reproducible test case, instrumented with instruction counts? Because here you are claiming that one function generates far more instructions than the other. These things are easy to quantify very precisely.
The only source of noise in these cases has to do with the branch predictor. But the function is written to have relatively few branch. Just make sure that your input has predictable number of 1s per word and you will tame the predictions.
Other than that, the instruction count can be known exactly with no room for variations.
We can just count.
There is no room for debate here. Provide the reproducible test case. You can get to pick your compiler : anything that runs under Docker is fair game.
Here is a demonstration based on @kelas's code of clang seemingly inserting an extra instruction in manually unrolled code:
https://godbolt.org/z/r75P1GzM8
Unroll:
bsf r10, rcx
add r10d, edx
mov esi, eax ; <------ here
mov dword ptr [rdi + 4*rsi], r10d
lea rsi, [rcx - 1]
and rsi, rcx
Loop:
bsf rsi, rcx
add esi, edx
mov dword ptr [rdi + 4*r9], esi
lea rsi, [rcx - 1]
and rsi, rcx
Note that I see an extra mov
rather than lea
in my demo.
Here is an even simpler demo with an extra lea
: https://godbolt.org/z/qbd8nxEYh
@abalkin The simdcsv library will not build on anything but NEON or AVX2 targets. Please compile either for NEON or AVX2 targets (e.g., add -march=haswell
or use an ARM 64 target).
Even if you do so (change the target), you will still see the extra LEA.
But here is what the current simdcsv code compiles to...
https://godbolt.org/z/hjf8cTqGb
There is no LEA as you can see.
haswell
ok
evidence lacking
here's the bare-minimum example of the exact lea
condition on haswell
microarch, which manifests in hand-unrolled version, and doesn't when unrolling is left to the compiler, all other things being equal (@lemire my apologies for somewhat opinionated coding style, but it seems appropriate to convey sense in a more succinct way):
https://godbolt.org/z/GfEb4536r
Adding -march=haswell
does not seem to make a difference.
https://godbolt.org/z/PEshGKGes
But as far as I know, @kelas did not see the extra lea
generated by your code, only by what he thought was an equivalent code that he wrote. I can now demonstrate that clang indeed seems to insert extra mov
or lea
instructions in manually unrolled code compared to an auto-unrolled loop.
@kelas - this is an interesting discussion, but it doesn't seem to be pertinate to this PR.
I have compiled simdcsv
with and without your changes and cannot see any difference in the size of the binary:
$ ls -l simdcsv*
-rwxr-xr-x 1 a staff 54320 Apr 16 19:36 simdcsv
-rwxr-xr-x 1 a staff 54320 Apr 16 19:35 simdcsv.a9494a
Nevertheless, I do see some increase in throughput:
$ ./simdcsv -i1000 examples/nfl.csv
GB/s: 6.38095
$ ./simdcsv.a9494a -i1000 examples/nfl.csv
GB/s: 5.93843
I have not tried to instrument these binaries as @lemire suggested or investigate the phenomenon any further, but apparently what you see in your code is not what makes this PR improve throughput in my environment.
same obj size better performance
coincidence? :) this is getting a bit C++ voodoo, I'm not an expert, not my area, not very fascinating tbh. what i know for sure is that my C port is faster than this implementation, objsize you're welcome to check on your own, but don't hold your breath - C is C.
@kelas @abalkin
What happens if you change SIMDCSV_BUFFERSIZE ? Note how it was doubled in this PR. I remember setting SIMDCSV_BUFFERSIZE "approximately".
Regarding the LEA, we can sum it up as follows. If you look at the code, we are comparing...
mov DWORD PTR [rdi+12+r8*4], ecx
with
lea ecx, [rsi + 3]
mov dword ptr [rdi + 4*rcx], r9d
It is interesting.
I am querying twitter to see if someone has an idea https://twitter.com/lemire/status/1383210384321036289
What happens if you change SIMDCSV_BUFFERSIZE ?
Does not seem to make a difference, but I now noticed that even with -i1000
the throughput measure is somewhat unstable
$ for i in `seq 10`; do ./simdcsv -i1000 examples/nfl.csv; done
GB/s: 6.26033
GB/s: 6.62941
GB/s: 5.83218
GB/s: 5.93164
GB/s: 6.43911
GB/s: 6.43891
GB/s: 6.13735
GB/s: 6.32976
GB/s: 6.25318
GB/s: 6.60594
and the improvement (if any) is not as evident as I thought:
$ for i in `seq 10`; do ./simdcsv.a9494a -i1000 examples/nfl.csv; done
GB/s: 6.8027
GB/s: 6.81799
GB/s: 6.03031
GB/s: 6.51562
GB/s: 6.20322
GB/s: 6.39817
GB/s: 5.96236
GB/s: 6.84384
GB/s: 6.88059
GB/s: 6.86554
Here are my results on a Zen 2 machine with GNU GCC 10.
Main branch:
$ for i in `seq 10`; do ./build/simdcsv -i1000 examples/nfl.csv |grep GB; done
GB/s: 4.73389
GB/s: 4.71476
GB/s: 4.71867
GB/s: 4.77128
GB/s: 4.7712
GB/s: 4.73851
GB/s: 4.69975
GB/s: 4.75183
GB/s: 4.77325
GB/s: 4.74215
This PR
$ for i in `seq 10`; do ./tmp/simdcsv/build/simdcsv -i1000 examples/nfl.csv |grep GB; done
GB/s: 4.66962
GB/s: 4.67889
GB/s: 4.61657
GB/s: 4.66997
GB/s: 4.67997
GB/s: 4.68536
GB/s: 4.687
GB/s: 4.68151
GB/s: 4.6803
GB/s: 4.67374
So at least in my test, this PR is not beneficial.
Note that I am using a machine configured for testing (not a laptop).
@kelas @abalkin
What happens if you change SIMDCSV_BUFFERSIZE ? Note how it was doubled in this PR. I remember setting SIMDCSV_BUFFERSIZE "approximately".
yes, back in 2019 you identified sweet spot to be four buffering interations for better instr/sec, i bumped it to 8 (also empirically, seems to perform marginally better on iron i have)
Regarding the LEA, we can sum it up as follows. If you look at the code, we are comparing... It is interesting.
true! i think i see what it is. in my last snippet, the loopy version can be degraded to lea
pollution by re-declaring loop counter as unsigned int (UI in my parlance) instead of U (unsigned long long). i assume the busy-looking static_cast
code escapes this exact caveat in C++ version, but i find that leaving it to the compiler to be a more defensive and fool-proof tactic.
that said, profiling this kind of critical codepaths is essential in any event - only the less code there is to shovel around, the easier it gets to identify a potential imperfection.
@kelas
i assume the busy-looking static_cast code escapes this exact caveat
I am not sure that the static_cast<uint32_t>(idx)
does anything at all. It looks like a junk cast to me.