simdcsv icon indicating copy to clipboard operation
simdcsv copied to clipboard

[+] auto-unroll

Open kelas opened this issue 3 years ago • 47 comments

@geofflangdale @lemire

great stuff, guys! too bad the project went stale.

here's what i got:

  1. fix manually unrolled loops in flatten_bits() which generate suboptimal code (clang -O3 auto-unrolls better, tested on both intel and arm, a noticeable improvement)
  2. buffering rounds bumped to 8 (seems a dash better than 4 suggested by Daniel)
  3. some cosmetics (un-hardcoded separator char etc)

for more discussion regarding #1, see my C port https://github.com/kelas/ld with disassembly listings:

$ cd ld && make unroll nounroll

or just: https://github.com/kelas/ld/blob/master/zip.UNROLL https://github.com/kelas/ld/blob/master/zip.NOUNROLL

PS. I just discovered that @lemire also attempted to make flatten_bits() a bit prettier, but as it shows compiler does the right thing without much heroics.

kelas avatar Apr 16 '21 07:04 kelas

fix manually unrolled loops in flatten_bits() which generate suboptimal code

I think you are saying that the compiler under some setting unrolls better than manually unrolled code.

I do not think that you are saying that this new code with C macros...

#define z(i) base_ptr[base + i] = static_cast<uint32_t>(idx) + trailingzeroes(bits); bits = bits & (bits - 1);
 really_inline void flatten_bits(uint32_t *base_ptr, uint32_t &base,
                                 uint32_t idx, uint64_t bits) {
   if (bits != 0u) {
     uint32_t cnt = hamming(bits);
     uint32_t next_base = base + cnt
      z(0)z(1)z(2)z(3)z(4)z(5)z(6)z(7)

... will be significantly better than the existing code...

really_inline void flatten_bits(uint32_t *base_ptr, uint32_t &base,
                                 uint32_t idx, uint64_t bits) {
   if (bits != 0u) {
     uint32_t cnt = hamming(bits);
     uint32_t next_base = base + cnt;
     base_ptr[base + 0] = static_cast<uint32_t>(idx) + trailingzeroes(bits);
     bits = bits & (bits - 1);
     base_ptr[base + 1] = static_cast<uint32_t>(idx) + trailingzeroes(bits);
     bits = bits & (bits - 1);
     base_ptr[base + 2] = static_cast<uint32_t>(idx) + trailingzeroes(bits);
     bits = bits & (bits - 1);
     base_ptr[base + 3] = static_cast<uint32_t>(idx) + trailingzeroes(bits);
     bits = bits & (bits - 1);
     base_ptr[base + 4] = static_cast<uint32_t>(idx) + trailingzeroes(bits);
     bits = bits & (bits - 1);
     base_ptr[base + 5] = static_cast<uint32_t>(idx) + trailingzeroes(bits);
     bits = bits & (bits - 1);
     base_ptr[base + 6] = static_cast<uint32_t>(idx) + trailingzeroes(bits);
     bits = bits & (bits - 1);
     base_ptr[base + 7] = static_cast<uint32_t>(idx) + trailingzeroes(bits);
     bits = bits & (bits - 1);

If not, why introduce a macro ?

Macros tend to harm readability and maintainability as you know.

Effectively, you are saying that you are getting better results without the manual unrolling. That's interesting.

lemire avatar Apr 16 '21 13:04 lemire

@kelas Assuming that you have access to Linux or an ARM-based Apple device, can you do the following and report back on the output ? (The sudo may not be necessary, depending on your configuration.)

git clone [email protected]:lemire/Code-used-on-Daniel-Lemire-s-blog.git
cd Code-used-on-Daniel-Lemire-s-blog/2019/05/03
make && sudo ./bitmapdecoding

The simdjson_decoder_rolled function is my adaptation of your idea (let us not unroll manually) and simdjson_decoder is our function.

Here is what I get... (Apple ARM, clang 12)

simdjson_decoder:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
    11.58 instructions/match (+/- 0.0 %)      2.82 cycles/match (+/- 0.1 %)
simdjson_decoder_rolled:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
    11.58 instructions/match (+/- 0.0 %)      2.82 cycles/match (+/- 0.1 %)

And with GCC 8 under Zen 2...

simdjson_decoder:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
instructions per cycle 1.88, cycles per value set:  4.216, instructions per value set: 7.946, cycles per word: 31.200, instructions per word: 58.800
 cycles per input byte 0.49 instructions per input byte 0.92
min:      312 cycles,      588 instructions, 	       2 branch mis.,        0 cache ref.,        0 cache mis.
avg:    388.7 cycles,    588.0 instructions, 	     4.1 branch mis.,      0.0 cache ref.,      0.0 cache mis.

simdjson_decoder_rolled:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
instructions per cycle 1.88, cycles per value set:  4.230, instructions per value set: 7.946, cycles per word: 31.300, instructions per word: 58.800
 cycles per input byte 0.49 instructions per input byte 0.92
min:      313 cycles,      588 instructions, 	       2 branch mis.,        0 cache ref.,        0 cache mis.
avg:    471.0 cycles,    588.1 instructions, 	     4.4 branch mis.,      0.0 cache ref.,      0.0 cache mis.

As you can see, I get nothing close to a 30% performance gain when avoiding the manual unrolling. In fact, I get exactly the same results down to the instruction count.

lemire avatar Apr 16 '21 15:04 lemire

BTW here is the current code in simdjson (the basis for simdcsv):

    if (bits == 0)
        return;
    int cnt = static_cast<int>(count_ones(bits));

    // Do the first 8 all together
    for (int i=0; i<8; i++) {
      this->tail[i] = idx + trailing_zeroes(bits);
      bits = clear_lowest_bit(bits);
    }

    // Do the next 8 all together (we hope in most cases it won't happen at all
    // and the branch is easily predicted).
    if (simdjson_unlikely(cnt > 8)) {
      for (int i=8; i<16; i++) {
        this->tail[i] = idx + trailing_zeroes(bits);
        bits = clear_lowest_bit(bits);
      }

      // Most files don't have 16+ structurals per block, so we take several basically guaranteed
      // branch mispredictions here. 16+ structurals per block means either punctuation ({} [] , :)
      // or the start of a value ("abc" true 123) every four characters.
      if (simdjson_unlikely(cnt > 16)) {
        int i = 16;
        do {
          this->tail[i] = idx + trailing_zeroes(bits);
          bits = clear_lowest_bit(bits);
          i++;
        } while (i < cnt);
      }
    }

    this->tail += cnt;
  }
};

lemire avatar Apr 16 '21 15:04 lemire

Macros tend to harm readability and maintainability as you know.

depends on your choice of macros :) this particular style indeed doesn't match the neighbourhood, it's a lazy copypaste from the C port, which is 20 lines of code total. i have no problem rewriting those two loops the loopy way, will do that later.

Effectively, you are saying that you are getting better results without the manual unrolling. That's interesting.

correct, i believe that's what i'm saying.

gcc8

sorry, i don't have access to legacy compiler tech. compilers do get better!

30% performance gain when avoiding the manual unrolling

affirmative. if you saw the disasm listings included in PR, the gist of it is this:

  1. clang12 and gcc10 auto-unroll each 8-step loop to this:
1000044c4: 41 89 c2                     movl %eax, %r10d
1000044c7: 31 f6                        xorl %esi, %esi
1000044c9: f3 48 0f bc f1               tzcntq %rcx, %rsi
1000044ce: 01 d6                        addl %edx, %esi
1000044d0: 42 89 34 97                  movl %esi, (%rdi,%r10,4)
1000044d4: c4 e2 f0 f3 c9               blsrq %rcx, %rcx
1000044d9: 31 f6                        xorl %esi, %esi
1000044db: f3 48 0f bc f1               tzcntq %rcx, %rsi
1000044e0: 01 d6                        addl %edx, %esi
1000044e2: 42 89 74 97 04               movl %esi, 4(%rdi,%r10,4)
1000044e7: c4 e2 f0 f3 c9               blsrq %rcx, %rcx
1000044ec: 31 f6                        xorl %esi, %esi
1000044ee: f3 48 0f bc f1               tzcntq %rcx, %rsi
1000044f3: 01 d6                        addl %edx, %esi
1000044f5: 42 89 74 97 08               movl %esi, 8(%rdi,%r10,4)
1000044fa: c4 e2 f0 f3 c9               blsrq %rcx, %rcx
1000044ff: 31 f6                        xorl %esi, %esi
100004501: f3 48 0f bc f1               tzcntq %rcx, %rsi
100004506: 01 d6                        addl %edx, %esi
100004508: 42 89 74 97 0c               movl %esi, 12(%rdi,%r10,4)
  1. whereas manual unroll heroics compile to this:
1000044c2: 44 89 14 b7                  movl %r10d, (%rdi,%rsi,4)
1000044c6: c4 e2 a8 f3 c9               blsrq %rcx, %r10
1000044cb: 31 f6                        xorl %esi, %esi
1000044cd: f3 49 0f bc f2               tzcntq %r10, %rsi
1000044d2: 01 d6                        addl %edx, %esi
1000044d4: 8d 48 01                     leal 1(%rax), %ecx           //!< here be dragons
1000044d7: 89 34 8f                     movl %esi, (%rdi,%rcx,4)
1000044da: c4 c2 a8 f3 ca               blsrq %r10, %r10
1000044df: 31 f6                        xorl %esi, %esi
1000044e1: f3 49 0f bc f2               tzcntq %r10, %rsi
1000044e6: 01 d6                        addl %edx, %esi
1000044e8: 8d 48 02                     leal 2(%rax), %ecx           //!< more here
1000044eb: 89 34 8f                     movl %esi, (%rdi,%rcx,4)
1000044ee: c4 c2 a8 f3 ca               blsrq %r10, %r10
1000044f3: 31 f6                        xorl %esi, %esi
1000044f5: f3 49 0f bc f2               tzcntq %r10, %rsi
1000044fa: 01 d6                        addl %edx, %esi
1000044fc: 8d 48 03                     leal 3(%rax), %ecx          //!< ad infinitum

i'm not sure where exactly the useless lea comes from, and tbh i don't want to know - sometimes it's ok to let the compiler do the right thing.

hope this helps k.

Sent from my Imperial Star Destroyer

kelas avatar Apr 16 '21 15:04 kelas

yup from the looks of it simdjson version would get auto-unrolled like a boss. blue skies.

but yeah - i think it is the first time i've hit a major performance degradation from an otherwise perfectly braindead manually unrolled loop.

them compiler writers know their thing, eh.

On Fri, 16 Apr 2021 at 17:51, Daniel Lemire @.***> wrote:

BTW here is the current code in simdjson (the basis for simdcsv):

if (bits == 0)
    return;
int cnt = static_cast<int>(count_ones(bits));

// Do the first 8 all together
for (int i=0; i<8; i++) {
  this->tail[i] = idx + trailing_zeroes(bits);
  bits = clear_lowest_bit(bits);
}

// Do the next 8 all together (we hope in most cases it won't happen at all
// and the branch is easily predicted).
if (simdjson_unlikely(cnt > 8)) {
  for (int i=8; i<16; i++) {
    this->tail[i] = idx + trailing_zeroes(bits);
    bits = clear_lowest_bit(bits);
  }

  // Most files don't have 16+ structurals per block, so we take several basically guaranteed
  // branch mispredictions here. 16+ structurals per block means either punctuation ({} [] , :)
  // or the start of a value ("abc" true 123) every four characters.
  if (simdjson_unlikely(cnt > 16)) {
    int i = 16;
    do {
      this->tail[i] = idx + trailing_zeroes(bits);
      bits = clear_lowest_bit(bits);
      i++;
    } while (i < cnt);
  }
}

this->tail += cnt;

} };

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geofflangdale/simdcsv/pull/8#issuecomment-821271498, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALVXXV7T7LE7F5E56J5D73TJBMIXANCNFSM43A76SYQ .

-- k.

Sent from my Imperial Star Destroyer

kelas avatar Apr 16 '21 16:04 kelas

i don't have access to legacy compiler tech.

If you follow... https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2019/05/03

You will notice that I offer Docker configurations and scripts allowing you to run it with the compilers I had used back in 2019.

Numbers with GCC 10 and Zen 2:

simdjson_decoder:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
instructions per cycle 1.82, cycles per value set:  4.392, instructions per value set: 8.000, cycles per word: 32.500, instructions per word: 59.200
 cycles per input byte 0.51 instructions per input byte 0.93
min:      325 cycles,      592 instructions, 	       2 branch mis.,        0 cache ref.,        0 cache mis.
avg:    370.2 cycles,    592.0 instructions, 	     3.8 branch mis.,      0.0 cache ref.,      0.0 cache mis.

simdjson_decoder_rolled:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
instructions per cycle 1.84, cycles per value set:  4.338, instructions per value set: 8.000, cycles per word: 32.100, instructions per word: 59.200
 cycles per input byte 0.50 instructions per input byte 0.93
min:      321 cycles,      592 instructions, 	       2 branch mis.,        0 cache ref.,        0 cache mis.
avg:    364.8 cycles,    592.0 instructions, 	     3.8 branch mis.,      0.0 cache ref.,      0.0 cache mis.

Numbers with Zen 2 and LLVM clang 9:

simdjson_decoder:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
instructions per cycle 1.72, cycles per value set:  4.811, instructions per value set: 8.297, cycles per word: 35.600, instructions per word: 61.400
 cycles per input byte 0.56 instructions per input byte 0.96
min:      356 cycles,      614 instructions, 	       3 branch mis.,        0 cache ref.,        0 cache mis.
avg:    414.6 cycles,    614.0 instructions, 	     4.2 branch mis.,      0.0 cache ref.,      0.0 cache mis.

simdjson_decoder_rolled:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
instructions per cycle 1.69, cycles per value set:  4.905, instructions per value set: 8.297, cycles per word: 36.300, instructions per word: 61.400
 cycles per input byte 0.57 instructions per input byte 0.96
min:      363 cycles,      614 instructions, 	       3 branch mis.,        0 cache ref.,        0 cache mis.
avg:    425.0 cycles,    614.0 instructions, 	     4.1 branch mis.,      0.0 cache ref.,      0.0 cache mis.

GCC 10 is current. GCC 8 came out in 2018.

30% performance gain when avoiding the manual unrolling (...) affirmative.

I submit to you that if the running time is 30% higher then you have discovered a bug with LLVM and you should report it to LLVM.

lemire avatar Apr 16 '21 17:04 lemire

Here my results with LLVM clang 13 on Zen 2:

just scanning:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
instructions per cycle 0.47, cycles per value set:  1.622, instructions per value set: 0.770, cycles per word: 12.000, instructions per word: 5.700
 cycles per input byte 0.19 instructions per input byte 0.09
min:      120 cycles,       57 instructions, 	       3 branch mis.,        0 cache ref.,        0 cache mis.
avg:    142.1 cycles,     57.0 instructions, 	     3.0 branch mis.,      0.0 cache ref.,      0.0 cache mis.

simdjson_decoder:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
instructions per cycle 1.57, cycles per value set:  5.068, instructions per value set: 7.932, cycles per word: 37.500, instructions per word: 58.700
 cycles per input byte 0.59 instructions per input byte 0.92
min:      375 cycles,      587 instructions, 	       3 branch mis.,        0 cache ref.,        0 cache mis.
avg:    428.4 cycles,    587.0 instructions, 	     5.2 branch mis.,      0.0 cache ref.,      0.0 cache mis.

simdjson_decoder_rolled:
matches = 74 words = 10 1-bit density 11.562 % 1-bit per word 7.400 bytes per index = 8.649
instructions per cycle 1.62, cycles per value set:  4.892, instructions per value set: 7.932, cycles per word: 36.200, instructions per word: 58.700
 cycles per input byte 0.57 instructions per input byte 0.92
min:      382 cycles,      587 instructions, 	       3 branch mis.,        0 cache ref.,        0 cache mis.
avg:    425.8 cycles,    587.0 instructions, 	     5.0 branch mis.,      0.0 cache ref.,      0.0 cache mis.

So I get the same results (exactly same number of instructions) with both functions.

You can use Dockerfile.llvmlatest to reproduce the results.

lemire avatar Apr 16 '21 17:04 lemire

You can use Dockerfile.llvmlatest to reproduce the results.

@lemire - do you mean this Dockerfile.llvmlatest?

abalkin avatar Apr 16 '21 18:04 abalkin

You can use Dockerfile.llvmlatest

do you mean this Dockerfile.llvmlatest

Yes.

lemire avatar Apr 16 '21 18:04 lemire

It appears that @kelas rediscovered an optimization that has been applied to simdjson a few years ago. I still find it puzzling that an auto-unrolled loop results in a better code than its repeated body.

abalkin avatar Apr 16 '21 20:04 abalkin

time 30% higher

it actually seems to fluctuate between 10% and 25%, i guess depending on the cost of lea instruction on a given microarch.

then you have discovered a bug with LLVM and you should report it to LLVM.

this is a bold statement, Daniel. both gcc10 and clang12 (-O3) auto-unroll these two loops without extra crud, and work faster. this is a double-blind validated result. sorry.

i therefore submit to you that this day and age manual loop wizardry from the language userland needs to be carefully profiled and validated. In this particular case, two pages of copy-pasted static_cast didn't do much good.

(also results in style which looks a bit watery to some, but tastes differ. some people hate macros.)

it's also reassuring to hear that simdjson was relieved of this optimisation at some point. great project by the way, keep up the good work.

k.

kelas avatar Apr 16 '21 20:04 kelas

According to @jkeiser, de-unrolling did not affect performance and was applied for purely aesthetic reasons. See the original Simplify and optimize flatten_bits patch.

abalkin avatar Apr 16 '21 20:04 abalkin

purely aesthetic reasons.

thanks for digging that out, alex! i second that, i originally de-unrolled both loops in the C port just because i didn't want to have zzzzzzzz in my code, but then i profiled it, and i was in for a little surprise.

kelas avatar Apr 16 '21 20:04 kelas

@abalkin

Most compilers at the highest level of optimization will unroll short simple loops... short is about 16 iterations. Note that this is sensitive to the optimization flags.

If you compile with -Os or -O2, then you will not get unrolling (GCC will not unroll at all at -O2). Both the -Os or -O2 flags are fairly popular and conventional.

If you really want unrolling, you better do it yourself in the code.

Note here that if you look at the code, even with the short loops from 0 to 8, it is still effectively unrolled. The no-unrolling would be like so...

       do {
          this->tail[i] = idx + trailing_zeroes(bits);
          bits = clear_lowest_bit(bits);
          i++;
        } while (i < cnt);

Empirically, it proves beneficial within simdjson to do the song and dance that you see.

It appears that @kelas rediscovered an optimization that has been applied to simdjson a few years ago

I do not recall it this way. It was not an optimization. Rather the author of this contribution noticed that when building release builds with CMake, we got the needed unrolling. There was never any claim that it improved the performance.

lemire avatar Apr 16 '21 21:04 lemire

If you really want unrolling, you better do it yourself in the code.

this little hiccup at hand provides ample evidence that it just ain't so. a human getting in the way of a modern optimising C/C++ compiler has every chance to end up being the bottleneck.

I do not recall it this way. It was not an optimization

still the jumbo commit description claims 15% improvement of stage 1 code, which i trust is some kind of early tokenization pass. hard to know what transpired, and what compilers were at play.

There was never any claim that it improved the performance.

there is one now.

kelas avatar Apr 16 '21 21:04 kelas

@kelas

this is a bold statement, Daniel

I disagree. I carefully provided multiple reproducible tests, using multiple compilers and more than one processor, using CPU performance counters, showing no difference.

I am getting perfectly comparable binaries (as measured by performance counters) and the assembly outputs are what I am expecting.

You are reporting that both GNU GCC and LLVM clang produce anomalous code that harms the performance. I submit to you that this should be reported.

Here is the clang 12 assembly produced with the unrolled code:

flatten_bits(unsigned int*, unsigned int&, unsigned int, unsigned long): # @flatten_bits(unsigned int*, unsigned int&, unsigned int, unsigned long)
  test rcx, rcx
  je .LBB0_6
  popcnt r9, rcx
  mov r10d, dword ptr [rsi]
  lea r8d, [r10 + r9]
  tzcnt rax, rcx
  add eax, edx
  mov dword ptr [rdi + 4*r10], eax
  blsr r10, rcx
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  inc eax
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 2
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 3
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 4
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 5
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 6
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor eax, eax
  tzcnt rax, r10
  add eax, edx
  mov ecx, dword ptr [rsi]
  add ecx, 7
  mov dword ptr [rdi + 4*rcx], eax
  cmp r9d, 9
  jb .LBB0_5
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 8
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 9
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 10
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 11
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 12
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 13
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 14
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor eax, eax
  tzcnt rax, r10
  add eax, edx
  mov ecx, dword ptr [rsi]
  add ecx, 15
  mov dword ptr [rdi + 4*rcx], eax
  cmp r9d, 17
  jb .LBB0_5
  blsr rax, r10
  mov ecx, dword ptr [rsi]
  add ecx, 16
  mov dword ptr [rsi], ecx
.LBB0_4: # =>This Inner Loop Header: Depth=1
  xor r9d, r9d
  tzcnt r9, rax
  add r9d, edx
  mov ecx, ecx
  mov dword ptr [rdi + 4*rcx], r9d
  mov ecx, dword ptr [rsi]
  inc ecx
  blsr rax, rax
  mov dword ptr [rsi], ecx
  jne .LBB0_4
.LBB0_5:
  mov dword ptr [rsi], r8d
.LBB0_6:
  ret

The core sequence is something like this...

  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 8
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx

Here is the GCC 10 code:

flatten_bits(unsigned int*, unsigned int&, unsigned int, unsigned long):
  test rcx, rcx
  je .L12
  xor r10d, r10d
  mov eax, DWORD PTR [rsi]
  xor r8d, r8d
  tzcnt r10, rcx
  popcnt r8, rcx
  add r10d, edx
  lea r9d, [rax+r8]
  mov DWORD PTR [rdi+rax*4], r10d
  blsr rax, rcx
  mov r10d, DWORD PTR [rsi]
  xor ecx, ecx
  tzcnt rcx, rax
  blsr rax, rax
  inc r10d
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 2
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 3
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 4
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 5
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 6
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  add r10d, 7
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  cmp r8d, 8
  jle .L4
  mov r10d, DWORD PTR [rsi]
  xor ecx, ecx
  blsr rax, rax
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 8
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 9
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 10
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 11
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 12
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 13
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 14
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 15
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  cmp r8d, 16
  jle .L4
  mov ecx, DWORD PTR [rsi]
  add ecx, 16
  mov DWORD PTR [rsi], ecx
.L5:
  xor r8d, r8d
  mov ecx, ecx
  tzcnt r8, rax
  add r8d, edx
  mov DWORD PTR [rdi+rcx*4], r8d
  mov ecx, DWORD PTR [rsi]
  inc ecx
  blsr rax, rax
  mov DWORD PTR [rsi], ecx
  jne .L5
.L4:
  mov DWORD PTR [rsi], r9d
  ret
.L12:
  ret

Both compiled with -O3.

Link: https://godbolt.org/z/5sn1WEMen

Now here is the 'rolled version' (same compiler setting). Clang.

flatten_bits(unsigned int*, unsigned int&, unsigned int, unsigned long): # @flatten_bits(unsigned int*, unsigned int&, unsigned int, unsigned long)
  test rcx, rcx
  je .LBB0_6
  popcnt r8, rcx
  mov r9d, dword ptr [rsi]
  tzcnt rax, rcx
  add eax, edx
  mov dword ptr [rdi + 4*r9], eax
  blsr r10, rcx
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  inc eax
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 2
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 3
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 4
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 5
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 6
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor eax, eax
  tzcnt rax, r10
  add eax, edx
  mov ecx, dword ptr [rsi]
  add ecx, 7
  mov dword ptr [rdi + 4*rcx], eax
  add r9d, r8d
  cmp r8d, 9
  jb .LBB0_5
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 8
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 9
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 10
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 11
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 12
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 13
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 14
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx
  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 15
  mov dword ptr [rdi + 4*rax], ecx
  cmp r8d, 17
  jb .LBB0_5
  blsr rax, r10
  mov ecx, dword ptr [rsi]
  add ecx, 16
  mov dword ptr [rsi], ecx
.LBB0_4: # =>This Inner Loop Header: Depth=1
  xor r8d, r8d
  tzcnt r8, rax
  add r8d, edx
  mov ecx, ecx
  mov dword ptr [rdi + 4*rcx], r8d
  mov ecx, dword ptr [rsi]
  inc ecx
  blsr rax, rax
  mov dword ptr [rsi], ecx
  jne .LBB0_4
.LBB0_5:
  mov dword ptr [rsi], r9d
.LBB0_6:
  ret

The core sequence is

  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  inc eax
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx

It is distinct from what we got in the rolled case, but barely, recall what we add...

  tzcnt rcx, r10
  add ecx, edx
  mov eax, dword ptr [rsi]
  add eax, 8
  mov dword ptr [rdi + 4*rax], ecx
  blsr r10, r10
  xor ecx, ecx

Importantly, the instruction count is the same.

And then GCC

flatten_bits(unsigned int*, unsigned int&, unsigned int, unsigned long):
  test rcx, rcx
  je .L12
  xor r10d, r10d
  mov eax, DWORD PTR [rsi]
  xor r8d, r8d
  tzcnt r10, rcx
  popcnt r8, rcx
  add r10d, edx
  lea r9d, [rax+r8]
  mov DWORD PTR [rdi+rax*4], r10d
  blsr rax, rcx
  mov r10d, DWORD PTR [rsi]
  xor ecx, ecx
  tzcnt rcx, rax
  blsr rax, rax
  inc r10d
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 2
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 3
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 4
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 5
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 6
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 7
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  cmp r8d, 8
  jle .L4
  mov r10d, DWORD PTR [rsi]
  xor ecx, ecx
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 8
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 9
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 10
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 11
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 12
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 13
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  add r10d, 14
  add ecx, edx
  blsr rax, rax
  mov DWORD PTR [rdi+r10*4], ecx
  xor ecx, ecx
  mov r10d, DWORD PTR [rsi]
  tzcnt rcx, rax
  blsr rax, rax
  add r10d, 15
  add ecx, edx
  mov DWORD PTR [rdi+r10*4], ecx
  cmp r8d, 16
  jle .L4
  mov ecx, DWORD PTR [rsi]
  add ecx, 16
  mov DWORD PTR [rsi], ecx
.L5:
  xor r8d, r8d
  mov ecx, ecx
  tzcnt r8, rax
  add r8d, edx
  mov DWORD PTR [rdi+rcx*4], r8d
  mov ecx, DWORD PTR [rsi]
  inc ecx
  blsr rax, rax
  mov DWORD PTR [rsi], ecx
  jne .L5
.L4:
  mov DWORD PTR [rsi], r9d
  ret
.L12:
  ret

If you switch to -O2 with GCC and the rolled code, you get this:

flatten_bits(unsigned int*, unsigned int&, unsigned int, unsigned long):
  mov r8d, edx
  test rcx, rcx
  je .L16
  mov r9d, DWORD PTR [rsi]
  xor r11d, r11d
  xor eax, eax
  popcnt r11, rcx
  lea r10d, [r9+r11]
  jmp .L4
.L18:
  mov r9d, DWORD PTR [rsi]
.L4:
  xor edx, edx
  add r9d, eax
  inc eax
  tzcnt rdx, rcx
  blsr rcx, rcx
  add edx, r8d
  mov DWORD PTR [rdi+r9*4], edx
  cmp eax, 8
  jne .L18
  cmp r11d, 8
  jle .L6
.L5:
  xor edx, edx
  mov r9d, eax
  inc eax
  add r9d, DWORD PTR [rsi]
  tzcnt rdx, rcx
  blsr rcx, rcx
  add edx, r8d
  mov DWORD PTR [rdi+r9*4], edx
  cmp eax, 16
  jne .L5
  cmp r11d, 16
  jle .L6
  mov eax, DWORD PTR [rsi]
  add eax, 16
  mov DWORD PTR [rsi], eax
.L7:
  xor edx, edx
  mov eax, eax
  tzcnt rdx, rcx
  add edx, r8d
  mov DWORD PTR [rdi+rax*4], edx
  mov eax, DWORD PTR [rsi]
  inc eax
  blsr rcx, rcx
  mov DWORD PTR [rsi], eax
  jne .L7
.L6:
  mov DWORD PTR [rsi], r10d
  ret
.L16:
  ret

Link

lemire avatar Apr 16 '21 21:04 lemire

@kelas

there is one now.

Yes. You are making this claim.

I find the evidence lacking.

It does not mean that your claim is not interesting. I'd love to document such a case myself. I love surprising things.

Can you provide me with a reproducible test case, instrumented with instruction counts? Because here you are claiming that one function generates far more instructions than the other. These things are easy to quantify very precisely.

The only source of noise in these cases has to do with the branch predictor. But the function is written to have relatively few branch. Just make sure that your input has predictable number of 1s per word and you will tame the predictions.

Other than that, the instruction count can be known exactly with no room for variations.

We can just count.

There is no room for debate here. Provide the reproducible test case. You can get to pick your compiler : anything that runs under Docker is fair game.

lemire avatar Apr 16 '21 22:04 lemire

Here is a demonstration based on @kelas's code of clang seemingly inserting an extra instruction in manually unrolled code:

https://godbolt.org/z/r75P1GzM8

Unroll:

        bsf     r10, rcx
        add     r10d, edx
        mov     esi, eax       ; <------ here
        mov     dword ptr [rdi + 4*rsi], r10d
        lea     rsi, [rcx - 1]
        and     rsi, rcx

Loop:

        bsf     rsi, rcx
        add     esi, edx
        mov     dword ptr [rdi + 4*r9], esi
        lea     rsi, [rcx - 1]
        and     rsi, rcx

Note that I see an extra mov rather than lea in my demo.

abalkin avatar Apr 16 '21 22:04 abalkin

Here is an even simpler demo with an extra lea: https://godbolt.org/z/qbd8nxEYh

abalkin avatar Apr 16 '21 23:04 abalkin

@abalkin The simdcsv library will not build on anything but NEON or AVX2 targets. Please compile either for NEON or AVX2 targets (e.g., add -march=haswell or use an ARM 64 target).

Even if you do so (change the target), you will still see the extra LEA.

But here is what the current simdcsv code compiles to...

https://godbolt.org/z/hjf8cTqGb

There is no LEA as you can see.

lemire avatar Apr 16 '21 23:04 lemire

haswell

ok

evidence lacking

here's the bare-minimum example of the exact lea condition on haswell microarch, which manifests in hand-unrolled version, and doesn't when unrolling is left to the compiler, all other things being equal (@lemire my apologies for somewhat opinionated coding style, but it seems appropriate to convey sense in a more succinct way):

https://godbolt.org/z/GfEb4536r

kelas avatar Apr 16 '21 23:04 kelas

Adding -march=haswell does not seem to make a difference.

https://godbolt.org/z/PEshGKGes

But as far as I know, @kelas did not see the extra lea generated by your code, only by what he thought was an equivalent code that he wrote. I can now demonstrate that clang indeed seems to insert extra mov or lea instructions in manually unrolled code compared to an auto-unrolled loop.

abalkin avatar Apr 16 '21 23:04 abalkin

@kelas - this is an interesting discussion, but it doesn't seem to be pertinate to this PR.

I have compiled simdcsv with and without your changes and cannot see any difference in the size of the binary:

$ ls -l simdcsv*
-rwxr-xr-x  1 a  staff  54320 Apr 16 19:36 simdcsv
-rwxr-xr-x  1 a  staff  54320 Apr 16 19:35 simdcsv.a9494a

Nevertheless, I do see some increase in throughput:

$ ./simdcsv -i1000 examples/nfl.csv
 GB/s: 6.38095
$ ./simdcsv.a9494a -i1000 examples/nfl.csv
 GB/s: 5.93843

I have not tried to instrument these binaries as @lemire suggested or investigate the phenomenon any further, but apparently what you see in your code is not what makes this PR improve throughput in my environment.

abalkin avatar Apr 16 '21 23:04 abalkin

same obj size better performance

coincidence? :) this is getting a bit C++ voodoo, I'm not an expert, not my area, not very fascinating tbh. what i know for sure is that my C port is faster than this implementation, objsize you're welcome to check on your own, but don't hold your breath - C is C.

kelas avatar Apr 16 '21 23:04 kelas

@kelas @abalkin

What happens if you change SIMDCSV_BUFFERSIZE ? Note how it was doubled in this PR. I remember setting SIMDCSV_BUFFERSIZE "approximately".


Regarding the LEA, we can sum it up as follows. If you look at the code, we are comparing...

        mov     DWORD PTR [rdi+12+r8*4], ecx

with

        lea     ecx, [rsi + 3]
        mov     dword ptr [rdi + 4*rcx], r9d

It is interesting.

lemire avatar Apr 17 '21 00:04 lemire

I am querying twitter to see if someone has an idea https://twitter.com/lemire/status/1383210384321036289

lemire avatar Apr 17 '21 00:04 lemire

What happens if you change SIMDCSV_BUFFERSIZE ?

Does not seem to make a difference, but I now noticed that even with -i1000 the throughput measure is somewhat unstable

 $ for i in `seq 10`; do ./simdcsv -i1000 examples/nfl.csv; done
 GB/s: 6.26033
 GB/s: 6.62941
 GB/s: 5.83218
 GB/s: 5.93164
 GB/s: 6.43911
 GB/s: 6.43891
 GB/s: 6.13735
 GB/s: 6.32976
 GB/s: 6.25318
 GB/s: 6.60594

and the improvement (if any) is not as evident as I thought:

$ for i in `seq 10`; do ./simdcsv.a9494a -i1000 examples/nfl.csv; done
 GB/s: 6.8027
 GB/s: 6.81799
 GB/s: 6.03031
 GB/s: 6.51562
 GB/s: 6.20322
 GB/s: 6.39817
 GB/s: 5.96236
 GB/s: 6.84384
 GB/s: 6.88059
 GB/s: 6.86554

abalkin avatar Apr 17 '21 00:04 abalkin

Here are my results on a Zen 2 machine with GNU GCC 10.

Main branch:

$ for i in `seq 10`; do ./build/simdcsv -i1000 examples/nfl.csv |grep GB; done
 GB/s: 4.73389
 GB/s: 4.71476
 GB/s: 4.71867
 GB/s: 4.77128
 GB/s: 4.7712
 GB/s: 4.73851
 GB/s: 4.69975
 GB/s: 4.75183
 GB/s: 4.77325
 GB/s: 4.74215

This PR

$ for i in `seq 10`; do ./tmp/simdcsv/build/simdcsv -i1000 examples/nfl.csv |grep GB; done
 GB/s: 4.66962
 GB/s: 4.67889
 GB/s: 4.61657
 GB/s: 4.66997
 GB/s: 4.67997
 GB/s: 4.68536
 GB/s: 4.687
 GB/s: 4.68151
 GB/s: 4.6803
 GB/s: 4.67374

So at least in my test, this PR is not beneficial.

Note that I am using a machine configured for testing (not a laptop).

lemire avatar Apr 17 '21 00:04 lemire

@kelas @abalkin

What happens if you change SIMDCSV_BUFFERSIZE ? Note how it was doubled in this PR. I remember setting SIMDCSV_BUFFERSIZE "approximately".

yes, back in 2019 you identified sweet spot to be four buffering interations for better instr/sec, i bumped it to 8 (also empirically, seems to perform marginally better on iron i have)

Regarding the LEA, we can sum it up as follows. If you look at the code, we are comparing... It is interesting.

true! i think i see what it is. in my last snippet, the loopy version can be degraded to lea pollution by re-declaring loop counter as unsigned int (UI in my parlance) instead of U (unsigned long long). i assume the busy-looking static_cast code escapes this exact caveat in C++ version, but i find that leaving it to the compiler to be a more defensive and fool-proof tactic.

that said, profiling this kind of critical codepaths is essential in any event - only the less code there is to shovel around, the easier it gets to identify a potential imperfection.

kelas avatar Apr 17 '21 00:04 kelas

@kelas

i assume the busy-looking static_cast code escapes this exact caveat

I am not sure that the static_cast<uint32_t>(idx) does anything at all. It looks like a junk cast to me.

lemire avatar Apr 17 '21 00:04 lemire