zfs RFC: Generic Fletcher4 vector implementation

Motivation and Context

I was nerd-sniped by the review process for #14219, which lead me to attempt to improve our fletcher4 code. My initial attempt was a beautiful algorithm, but it was not performant, so I decided to reimplement Intel's algorithm in generic GNU C vector code. This allows multiple architectures to have vectorized fletcher4 code, although in practice only PPC64 is expected to see a benefit from this since it does not already have vectorized fletcher4 code. Beyond improving performance on PPC64, this serves as a proof of concept for the idea, since the AVX2 code generated for x86/x86_64 systems is functionally identical to the handwritten AVX2 code we have now. Performance is effectively identical for the native case:

# cat /proc/spl/kstat/zfs/fletcher_4_bench
0 0 0x01 -1 0 1356713338 54571347430
implementation   native         byteswap       
scalar           9807681425     8305936247     
superscalar      12644986237    12713922615    
superscalar4     13007328108    12526754559    
sse2             24056322356    12412353845    
ssse3            23744546669    20716336171    
avx2             40177711703    38436973064    
generic-avx2     42662563525    9563983290     
fastest          generic-avx2   avx2

For the byteswap case, the problem is that GCC 11.3.0 does not generate optimal assembly. However, GCC 12.2.1_p20221008 does and is expected to have identical performance. From the numbers, it appears that the generic-avx2 code is faster on the native case here, but in reality, that is from normal run by run variation. In some runs, the existing avx2 implementation is faster. In others, the generic-avx2 implementation is faster. The two are effectively equal. This was confirmed by disassembly. The disassembly for critical loop for native in generic-avx2 from GCC 12.2.1_p20221008 is:

 198:   c4 e2 7d 35 26          vpmovzxdq (%rsi),%ymm4
 19d:   c5 ed d4 d4             vpaddq %ymm4,%ymm2,%ymm2
 1a1:   48 83 c6 10             add    $0x10,%rsi
 1a5:   c5 f5 d4 ca             vpaddq %ymm2,%ymm1,%ymm1
 1a9:   c5 fd d4 c1             vpaddq %ymm1,%ymm0,%ymm0
 1ad:   c5 e5 d4 d8             vpaddq %ymm0,%ymm3,%ymm3
 1b1:   48 39 d6                cmp    %rdx,%rsi
 1b4:   72 e2                   jb     198 <fletcher_4_generic_native_impl+0x28>

For the handwritten intel assembly, we have:

 328:   c4 e2 7d 35 23          vpmovzxdq (%rbx),%ymm4
 32d:   c5 fd d4 c4             vpaddq %ymm4,%ymm0,%ymm0
 331:   c5 f5 d4 c8             vpaddq %ymm0,%ymm1,%ymm1
 335:   c5 ed d4 d1             vpaddq %ymm1,%ymm2,%ymm2
 339:   c5 e5 d4 da             vpaddq %ymm2,%ymm3,%ymm3
 33d:   48 83 c3 10             add    $0x10,%rbx
 341:   48 39 eb                cmp    %rbp,%rbx
 344:   72 e2                   jb     328 <fletcher_4_avx2_native+0x38>

Similarly, this is what generic-avx2 generates for the byteswap code with the GCC12.2.1_p20221008:

 260:   c4 e2 7d 35 06          vpmovzxdq (%rsi),%ymm0
 265:   c4 e2 7d 00 c5          vpshufb %ymm5,%ymm0,%ymm0
 26a:   48 83 c6 10             add    $0x10,%rsi
 26e:   c5 e5 d4 d8             vpaddq %ymm0,%ymm3,%ymm3
 272:   c5 ed d4 d3             vpaddq %ymm3,%ymm2,%ymm2
 276:   c5 f5 d4 ca             vpaddq %ymm2,%ymm1,%ymm1
 27a:   c5 dd d4 e1             vpaddq %ymm1,%ymm4,%ymm4
 27e:   48 39 d6                cmp    %rdx,%rsi
 281:   72 dd                   jb     260 <fletcher_4_generic_byteswap_impl+0x30>

And this is what the handwritten version generates:

 2a0:   c4 e2 7d 35 23          vpmovzxdq (%rbx),%ymm4
 2a5:   c4 e2 5d 00 e5          vpshufb %ymm5,%ymm4,%ymm4
 2aa:   c5 fd d4 c4             vpaddq %ymm4,%ymm0,%ymm0
 2ae:   c5 f5 d4 c8             vpaddq %ymm0,%ymm1,%ymm1
 2b2:   c5 ed d4 d1             vpaddq %ymm1,%ymm2,%ymm2
 2b6:   c5 e5 d4 da             vpaddq %ymm2,%ymm3,%ymm3
 2ba:   48 83 c3 10             add    $0x10,%rbx
 2be:   48 39 eb                cmp    %rbp,%rbx
 2c1:   72 dd                   jb     2a0 <fletcher_4_avx2_byteswap+0x40>

Clang will output the same assembly for avx2-generic and presumably, also the existing avx2 code, but the latter was not checked.

For completeness, here is the gorgeous VSX assembly produced by Clang for the native case on ppc64:

.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        lxvw4x 1, 0, 4
        addi 4, 4, 16
        xxmrglw 40, 0, 1
        xxmrghw 41, 0, 1
        vaddudm 3, 3, 8
        vaddudm 2, 2, 9
        vaddudm 1, 3, 1
        vaddudm 0, 2, 0
        vaddudm 5, 1, 5
        vaddudm 4, 0, 4
        vaddudm 7, 5, 7
        vaddudm 6, 4, 6
        bdnz .LBB0_2

Also, here is the Clang produced assembly for the native case on ppc64le:

.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        lxvd2x 1, 0, 4
        addi 4, 4, 16
        xxswapd 1, 1
        xxmrghw 40, 0, 1
        xxmrglw 41, 0, 1
        vaddudm 7, 7, 8
        vaddudm 6, 6, 9
        vaddudm 1, 7, 1
        vaddudm 5, 6, 5
        vaddudm 0, 1, 0
        vaddudm 4, 5, 4
        vaddudm 3, 0, 3
        vaddudm 2, 4, 2
        bdnz .LBB0_2

VSX is only capable of doing two 64-bit additions in parallel. I do not understand PPC64 assembly very well, but the ppc64 assembly looks optimal to me while the ppc64le assembly is near optimal. The xxswapd instruction seems unnecessary to me, but I assume that future versions of LLVM will fix that. I will refrain from posting the GCC assembly, since it is fairly bad.

ARM support was initially attempted, but the way that GCC and Clang allow ISA extensions to be turned on/off made it difficult to do, so it was abandoned. In specific, on the architectures where this currently generates SIMD instructions in the kernel, we opt out of ISA extensions via compiler flags and can opt into them on individual functions via pragmas. On ARM, there is an opt-out via -mgeneral-regs-only, but there is no way given to opt-in via pragmas. One idea was to disable that via CFLAGS_REMOVE and then set it on all of the non-accelerated functions, but it did not work out well in practice on GCC and at that point, I decided it was not worth it to try to continue.

Rewriting this code in FORTRAN was briefly considered as a way to make ARM support easier. Unfortunately, with the exception of a non-standard extension in the Sun Studio FORTRAN compiler, FORTRAN does not support unsigned integer arithmetic, so the idea was dropped.

Prior to abandoning the attempt at ARM support, I did discover that Clang will compile the code into optimal looking NEON assembly, so perhaps this could be revisited in the future. Alternatively, the assembly generated by Clang could be helpful in review of #14219.

In summary, the only real benefactor of this is expected to be FreeBSD's PPC64 port and anyone who builds the ZFS kernel modules for Linux on PPC64 with Clang. I am not sure if that makes it worthwhile to merge this as is. I am putting this into a pull request to see what others think for the sake of my own curiosity.

Description

The individual patch commit messages have greater detail. However, in summary, this adds a generic vector implementation in GNU dialect of C's vector extension to the C language. Both GCC and Clang will compiled it into SIMD instructions on x86, x86_64 and ppc64 on both Linux and FreeBSD. On other architectures, the GCC and Clang compilers will generate scalar instructions that are expected to be equivalent to superscalar4. This is not compatible with the MSVC compiler, so we premptively disable it on Windows to make things easier for the Windows port.

How Has This Been Tested?

It has been tested in a x86_64 VM running Gentoo Linux. The assembly output of fletcher_4_generic_native_impl() on various architectures was examined at gcc.godbolt.org.

Types of changes

[ ] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[x] Performance enhancement (non-breaking change which improves efficiency)
[ ] Code cleanup (non-breaking change which makes code smaller or more readable)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
[ ] Documentation (a change to man pages or other documentation)

Checklist:

[x] My code follows the OpenZFS code style requirements.
[ ] I have updated the documentation accordingly.
[x] I have read the contributing document.
[ ] I have added tests to cover my changes.
[ ] I have run the ZFS Test Suite with this change applied.
[x] All commit messages are properly formatted and contain Signed-off-by.

Nov 28 '22 15:11 ryao

I might be wrong here but I believe the swap is necessary if the element ordering happens to be critical. This has to do with the oddity that with VSX all loads into vector registers are implicitly byte swapping to be big endian. Because the multiplication occurring in the finalization of the checksum (it addresses particular vector lanes), I suspect that's needed for correctness.

Nov 28 '22 16:11 KungFuJesus

I might be wrong here but I believe the swap is necessary if the element ordering happens to be critical. This has to do with the oddity that with VSX all loads into vector registers are implicitly byte swapping to be big endian. Because the multiplication occurring in the finalization of the checksum (it addresses particular vector lanes), I suspect that's needed for correctness.

Prior to posting the PR, a quick search for information on xxswapd suggested that it did a permutation at the level of 32-bit components, which is unrelated to endian-ness.

That said, documentation on it was not readily available from DuckDuckGo searches, so what I read could be wrong.

Nov 28 '22 16:11 ryao

Yes it seems IBM's documentation is not being indexed very well by any of the search engines. I've routinely had this problem finding info on vmx/vsx.

I'm not sure I follow what you mean by 32 bit components. If you're loading 4 32 bit values to sum, the memory ordering is little endian but the load instruction itself implicitly byte swaps the values as they are shuffled over the bus into the vector registers. The swap is supposed to move them back in proper order, I think.

Now, for something that sums every byte, you can eliminate these swaps. But I think for anything operating over values that are large enough to have endianness you need to swap. I think there's another idiom that usually follows these loads that involves an xxperm instruction.

Note: I only speculate based on documentation I've read and pieces of code I've seen to handle the bi-endianness of the architecture. I personally have only had experience with big endian vmx capable PowerPC CPUs.

Nov 29 '22 03:11 KungFuJesus

I spotted the GCC PowerPC maintainer on GitHub the other day. Rather than guessing, I am just going to ping him.

@edelsohn Would you be willing to tell us if the xxswapd instruction is necessary here?

Also, as a heads up, GCC generates fairly bad code for this in comparison to Clang:

https://gcc.godbolt.org/z/bhPo9sWsx https://gcc.godbolt.org/z/4rTEe3WMG

The code generation became slightly better from setting power9-vector, but that really should not have been necessary. Is there anything on the horizon that will fix this?

Nov 29 '22 13:11 ryao

My colleagues @nemanjai and @segher have more expertise about VSX.

Nov 29 '22 14:11 edelsohn

The GCC issue appears to be a generic vector lowering problem.

GCC issue opened https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

Nov 29 '22 14:11 edelsohn

The GCC issue appears to be a generic vector lowering problem.

GCC issue opened https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

Awesome thanks. :)

Nov 29 '22 15:11 ryao

I did a bit more experimentation with this and discovered that we can go even faster than the existing AVX2 code by doing 8 parallel accumulator streams as done in the AVX512 code:

# cat /proc/spl/kstat/zfs/fletcher_4_bench
0 0 0x01 -1 0 1544703784 7905022158
implementation   native         byteswap
scalar           9844805719     9474656265
superscalar      14186756773    12659613432
superscalar4     13727489481    12508784940
generic-avx2     49197257633    2958823554
sse2             22567594300    12265074850
ssse3            22566380108    20582510550
avx2             42224833182    38138634892
fastest          generic-avx2   avx2

That brings us to around 96% of Xen 3's 51.2GB/sec memory bandwidth. This works because Zen 3 can do 4x 256-bit SIMD additions per cycle according to Agner Fog:

https://agner.org/optimize/instruction_tables.pdf

Going to 4x is unnecessary since we are hitting the memory bandwidth limit with 2x. The more accumulator streams we generate, the more complicated recombination is. When there is no reason to go wider, going wider will penalize us in the recombination function.

Note that we still have bad code generation by GCC 11.3.0 ruining the byteswap case. I am building GCC 12 in my VM, which should allow me to get good numbers for the byteswap case since GCC generates good code there.

Nov 29 '22 15:11 ryao

The VSX instruction lxvd2x on little endian systems loads the vector in such a way that the doublewords (8-byte) halves of the vector are placed in opposite positions in the register while the byte order within each doubleword is little endian. So in order to have the register contain the correct data in the vector, we need to swap the two doublewords. That is what the xxswapd instruction accomplishes (swaps doublewords). So on little endian, all such vector loads are followed by the swap and all vector stores are preceded by one. The compilers have a pass that tries to eliminate those swaps if vector data is used in a lane-insensitive way (element-wise addition is certainly one such operation). However, you will notice that the code contains permute instructions which are certainly lane-sensitive which I assume is the reason we keep the swap.

Please note that the target features the pragmas provide allow the GCC version to produce code targeting the Power9 CPU whereas the Clang version is targeting the Power8 CPU. If this program is truly expected to run only on Power9 or newer CPU's, you can add power9-vector,isa-v30-instructions to the set of target features in the Clang pragma. That will use load/store instructions introduced in the version 3.0 of the ISA which do not require the swap instructions.

Note also, although there are permute instructions, the permute masks can be adjusted to account for the position of each element in the vector and that is something that we're always working on improving in Clang. So in the future, even for Power8 code generation, that xxswapd may go away.

Nov 29 '22 17:11 nemanjai

The VSX instruction lxvd2x on little endian systems loads the vector in such a way that the doublewords (8-byte) halves of the vector are placed in opposite positions in the register while the byte order within each doubleword is little endian. So in order to have the register contain the correct data in the vector, we need to swap the two doublewords. That is what the xxswapd instruction accomplishes (swaps doublewords). So on little endian, all such vector loads are followed by the swap and all vector stores are preceded by one. The compilers have a pass that tries to eliminate those swaps if vector data is used in a lane-insensitive way (element-wise addition is certainly one such operation). However, you will notice that the code contains permute instructions which are certainly lane-sensitive which I assume is the reason we keep the swap.

Thank you for the explanation. The reason for the permutation is that we need to widen 32-bit words to 64-bit words for the additions.

Please note that the target features the pragmas provide allow the GCC version to produce code targeting the Power9 CPU whereas the Clang version is targeting the Power8 CPU. If this program is truly expected to run only on Power9 or newer CPU's, you can add power9-vector,isa-v30-instructions to the set of target features in the Clang pragma. That will use load/store instructions introduced in the version 3.0 of the ISA which do not require the swap instructions.

POWER8 is the true minimum. power9-vector was specified because it improved the assembly generation on GCC. It was a bad attempt at working around this GCC bug before I understood the nature of it:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

I expect to have a better attempt in the near future.

Note also, although there are permute instructions, the permute masks can be adjusted to account for the position of each element in the vector and that is something that we're always working on improving in Clang. So in the future, even for Power8 code generation, that xxswapd may go away.

I have devised and pushed a new method for doing this via explicit masks and __builtin_shufflevector() for AVX2 on Intel hardware to workaround GCC's bad byteswap generation prior to GCC 12. This makes me think that when I rewrite this to workaround the GCC bug for POWER, there is a chance that is enough to make the xxswapd go away. I will be sure to check that.

Thanks for your insights. :)

Nov 29 '22 18:11 ryao

Prior to GCC 12, GCC had been generating scalar byteswap code that ruined performance. I just worked around that via __builtin_shuffle() and now the older versions of GCC will generate good byteswap code as far back as GCC 4.9.0. Older versions of GCC are not able to compile this because they do not understand the inline AVX2 instruction. Keep in mind that there is performance variation between runs, but here are the current numbers from a KVM Gentoo guest on Zen 3:

# cat /proc/spl/kstat/zfs/fletcher_4_bench
0 0 0x01 -1 0 770448447 7874698323
implementation   native         byteswap       
scalar           9674935874     9342370616     
superscalar      12389694710    12378469202    
superscalar4     13417221678    12187713211    
generic-avx2     47470237051    39416631033    
sse2             23714943741    11366615899    
ssse3            23692572459    20166280422    
avx2             41473313254    37962243248    
fastest          generic-avx2   generic-avx2

The only slight boost over the existing avx2 code in the byteswap case is a disappointment. I plan to investigate it some more.

Nov 29 '22 18:11 ryao

You can avoid the swap from the loop even with the current clang implementation with something like this: https://gcc.godbolt.org/z/8P7hcPTP6 But I'm not sure it's worth the additional complexity in the code. Of course, if all the additional swaps outside the loop are a concern, this algorithm can of course be completely implemented with vector builtins - again with significant added complexity.

Nov 29 '22 19:11 nemanjai

The only slight boost over the existing avx2 code in the byteswap case is a disappointment. I plan to investigate it some more.

According to Agner Fog, the additions can use execution ports 0, 1, 2 and 3, but the shuffles used for the byteswap can only use execution ports 1 and 2. My guess is that the CPU pipeline starts the next iteration before the current iteration has finished to to try occupy all 4 execution ports. Then since we only have 2 execution ports for the shuffle, it hits a pipeline hazard and handles that by inserting bubbles, which reduces execution port occupancy. Thus we do slightly better than we did with the older avx2 byteswap case since we keep the execution ports occupied slightly more, but not by much due to the pipeline hazard.

Nov 29 '22 19:11 ryao

You can avoid the swap from the loop even with the current clang implementation with something like this: https://gcc.godbolt.org/z/8P7hcPTP6 But I'm not sure it's worth the additional complexity in the code. Of course, if all the additional swaps outside the loop are a concern, this algorithm can of course be completely implemented with vector builtins - again with significant added complexity.

I had not considered using vector builtins. When I had tried using Intel intrinsics, the result failed to compile inside the kernel since we do not have the userspace headers, so I had fallen back on inline assembly. Afterward, I had been treating vector builtins as if they were intrinsics that need a header, but that was wrong.

I think I will do a hybrid approach. I will write a generic version that does 128-bit SIMD operations in the GNU C code to workaround the GCC vector lowering bug on PPC64. Then I will add vector builtins where needed if the resulting code is unsatisfactory.

Thanks again for your help. Since you are probably going to see this comment, would you know if there is a POWER version of Agner Fog's instruction tables anywhere?

https://agner.org/optimize/instruction_tables.pdf

Those have been helpful in showing me what the capabilities of x86_64 hardware are. I feel blind without a similar reference for POWER, since I do not know how wide the hardware actually is and it is awkward asking people I know that have POWER systems (Talos II, I think) to test multiple variations of the same kernel code repeatedly.

Nov 29 '22 19:11 ryao

Note that the x86 concept of " execution ports" is somewhat unique and other architectures don't have the same bottleneck.

You can see the Power scheduling characteristics in the GCC or LLVM instruction scheduling description.

You also can request an account on the CFARM, which hosts many different architectures, including Power. And you can request a Power VM through the PowerDev project at OSUOSL: https://osuosl.org/services/powerdev/request_hosting/

Nov 29 '22 20:11 edelsohn

Note that the x86 concept of " execution ports" is somewhat unique and other architectures don't have the same bottleneck.

You can see the Power scheduling characteristics in the GCC or LLVM instruction scheduling description.

You also can request an account on the CFARM, which hosts many different architectures, including Power. And you can request a Power VM through the PowerDev project at OSUOSL: https://osuosl.org/services/powerdev/request_hosting/

Sort of, right? Ultimately there's going to be an upper limit to the instruction level parallelism that is bound to how deep the pipeline is and how many execution units are in the FPU. From what I recall, and this is the bad old days of POWER yore (such as the PowerPC 970) you had to unroll loops a painful amount to realize the full potential of the superscalar nature, as it couldn't just automagically fill the pipeline and break data dependency chains. To some extent with checksums, this is still necessary because no matter how sophisticated the micro architecture, it doesn't know that the algorithm is agnostic to things like the order in which things are accumulated or how many different registers can be in flight for the intermediate sums before they matter.

Now, I imagine POWER has advanced quite a bit since 2005 and it's maybe a bit better at speculating further out. And maybe things like register renaming can prevent clobbering over fixed architectural register destinations?

Nov 29 '22 20:11 KungFuJesus

Power has decode groups, but most simple load, store, and vector instructions can occupy any slot and issue to any function unit. For performance, one assumes single threaded configuration with maximum number of function units available. x86 bottleneck is the execution ports, which can be filled even if there are available function units for the uops.

Nov 29 '22 20:11 edelsohn

Power has decode groups, but most simple load, store, and vector instructions can occupy any slot and issue to any function unit. For performance, one assumes single threaded configuration with maximum number of function units available. x86 bottleneck is the execution ports, which can be filled even if there are available function units for the uops.

Ah ok, so the analogy really falls apart around the fact that only particular instructions can be dispatched to certain execution ports with x86. In that sense the POWER uarch has fewer restrictions during its instruction reordering but a similar amount of calculus has to be applied for data dependency chains vs total instruction throughput.

Nov 30 '22 18:11 KungFuJesus

Note that the x86 concept of " execution ports" is somewhat unique and other architectures don't have the same bottleneck.

I assume that the individual ALUs (or VSUs in POWER terminology) are still a bottleneck even if they are not grouped into execution ports. I sort of found what I want on wikichip, although I would have liked more granular information on which instructions go to which units and what the instruction throughputs are:

https://en.wikichip.org/wiki/ibm/microarchitectures/power9#SMT4_core

Unfortunately, wikichip only had a core description for POWER9. I would have liked to know a bit more about POWER8 and POWER10 too.

You can see the Power scheduling characteristics in the GCC or LLVM instruction scheduling description.

Unfortunately, I am not sure where that is.

That said, I made progress with a generic version that uses 128-bit SIMD operations:

https://gcc.godbolt.org/z/YofE8Eej7

I no longer need to do the power9-vector hack for GCC.

I am a little surprised by the code generation of the respective compilers. GCC 12.2.0 generated near optimal code for ppc64le. The ppc64be version does not look quite as good, but after fighting with compilers for a few hours, I am more willing to accept a little imperfection in both cases. I suspect that the superscalar architecture of POWER will do the loads and permutations for the next iteration during the additions for the current iteration, such that better code might save energy, but it would not execute any faster.

Clang's output does not look so great on the byteswap case, but this was a hack meant to work around a GCC bug, so there is no need to feed Clang this version. Instead, it can get the version that does not workaround GCC's vector lowering bug, on which it generates fairly good code:

https://gcc.godbolt.org/z/d4oPr7Yos

Note that since pinging edelsohn, I have switched to an implementation of the algorithm that does 8 parallel accumulator streams. The earlier godbolt links had been based on the 4 parallel accumulator stream version. The 8 parallel accumulator stream version is what gave us a boost on Zen 3 and I am inclined to give the same to POWER for the sake of reducing the number of parallel versions we need to maintain. From what I am reading, 8 parallel accumulator streams might be overkill for it, but it hopefully would somewhat future proof this against future POWER processors.

I did try to further improve the loads, but GCC did not support the Clang builtins and it generated additional instructions when I try inlining things like asm ("lxvd2x %0, 0, %1" : "=r" (t) : "m" (*ip));. I had been more willing to do this when I thought I could do 1 solution for both compilers like I did on x86_64, but having to do it 8 times to cover both compilers, both endians and both native/byteswap is just too much for now, so I will leave that for future work.

@KungFuJesus It is a moot point now, but I found the source for my earlier remark:

Prior to posting the PR, a quick search for information on xxswapd suggested that it did a permutation at the level of 32-bit components, which is unrelated to endian-ness.

That was based on this:

Calling xxswapd followed by vperm seems to be a lot like calling shuffle_epi32 followed by shuffle_epi8 on an x86 machine. It feels like the two permutes should be folded into one.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753#c0

Nov 30 '22 19:11 ryao

I have just pushed a revised patch with the latest work. There is a version on godbolt:

https://gcc.godbolt.org/z/hz1xjjWv4

Nov 30 '22 19:11 ryao

In GCC source code, the scheduling description is in the target directory with processor names, e.g., power8.md, power9.md, power10.md

https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/rs6000/power10.md;h=80b1087c428c249a2b85e982d8f7fb760f1ab40e;hb=refs/heads/trunk

The inlined asm will not produce efficient code because the constraints are wrong.

asm ("lxvd2x %0, 0, %1" : "=r" (t) : "m" (*ip));

The processor is not a VAX with only general purpose registers. The r constraint is for GPRs and lxvd2x loads into a VSR. Also, m constraint may produce an invalid memory reference syntax for that instruction. One wants something like

asm ("lxvd2x %x0, %y1" : "=wa" (t) : "Z" (*ip));

I'm surprised that GCC chose to generate lvx instead of lxvd2x. But with recent releases of GCC and Clang, it's best to use assignment operators and allow the compiler to generate the vector load and store code instead of trying to choose the best, ever-changing vector load/store instruction for inline assembly.

Nov 30 '22 19:11 edelsohn

Hi!

You should never code byteswaps or load/store instructions directly: that only prevents the compiler from doing a good job. All the pointer casts in this hinder a lot as well (and make this invalid C code most likely!) You can write this as valid code as well, which as a bonus will be much easier to read?

Nov 30 '22 21:11 segher

In GCC source code, the scheduling description is in the target directory with processor names, e.g., power8.md, power9.md, power10.md

https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/rs6000/power10.md;h=80b1087c428c249a2b85e982d8f7fb760f1ab40e;hb=refs/heads/trunk

That tells me what I need to know. Thanks.

The inlined asm will not produce efficient code because the constraints are wrong.

asm ("lxvd2x %0, 0, %1" : "=r" (t) : "m" (*ip));

The processor is not a VAX with only general purpose registers. The r constraint is for GPRs and lxvd2x loads into a VSR. Also, m constraint may produce an invalid memory reference syntax for that instruction. One wants something like

asm ("lxvd2x %x0, %y1" : "=wa" (t) : "Z" (*ip));

Unfortunately, the constraint modifiers do not appear to be documented for POWER:

https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Extended-Asm.html#Extended-Asm

I had been blindly guessing and hoping that the output would look sane. That did not succeed for the reasons you stated. I probably should have looked at the GCC source code, but navigating it has a fairly high learning curve, so I had refrained from doing that.

I'm surprised that GCC chose to generate lvx instead of lxvd2x. But with recent releases of GCC and Clang, it's best to use assignment operators and allow the compiler to generate the vector load and store code instead of trying to choose the best, ever-changing vector load/store instruction for inline assembly.

That is good advice. I would have liked to have taken it on all architectures, but unfortunately, GCC does not generate x86_64 code that beats the implementation Intel wrote in 2015 without an inline instruction to load things efficiently. Nearly all of the project contributors develop on Intel/AMD hardware and most use GCC, so I would have trouble getting reviewers for this without it.

I have cleaned up the PR to put it into a mergeable state. I am going to proceed with testing as soon as I either get a volunteer or get a PowerVM at the OSUOSL, although I will probably be too busy to apply for it until next week. If that goes well, this can be merged.

Nov 30 '22 21:11 ryao

Hi!

You should never code byteswaps or load/store instructions directly: that only prevents the compiler from doing a good job.

I checked the code generated by several compilers and I can guarantee that the compiler does a much worse job without those tricks. Performance drops from 48GB/sec to 28GB/sec on Zen 3 when compiling for GCC merely from dropping the hard coded vpmovzxdq. The reason why becomes obvious when looking at the disassembly:

https://gcc.godbolt.org/z/8Yod7MoGx https://gcc.godbolt.org/z/5q56qs7az

Clang does not have that problem, but most of our users use GCC. The suggestion to in-line the instruction (and much more) came from Intel:

https://www.intel.com/content/www/us/en/developer/articles/technical/fast-computation-of-fletcher-checksums.html https://github.com/openzfs/zfs/blob/master/module/zcommon/zfs_fletcher_intel.c

Avoiding the need to hard code assembly for everything but that one instruction while being faster is an improvement over the status quo.

As for the byteswaps, both Clang and GCC generate worse code unless I do the byteswaps via __builtin_shuffle/__builtin_shuffle_vector. They apparently have optimization passes that do the right thing for those two, but not for placing a series of scalar byteswaps (since there is no generic vector byteswap) when the vector width is 512-bit.

All the pointer casts in this hinder a lot as well (and make this invalid C code most likely!) You can write this as valid code as well, which as a bonus will be much easier to read?

I had not considered that the casts could get in the way of optimization. You are also right that I can avoid that idiom to make the code easier to read, so I will do that.

Edit: Earlier versions of this had a digression about the standard labeling that idiom undefined when it did not cause miscompilation either in theory or in practice. My unsolicited commentary on the standard was a mistake. I apologize for that counterproductive digression.

Nov 30 '22 21:11 ryao

@segher I rewrote this to make it more readable per your recommendation. Clang's output is unchanged on all architectures. GCC's output is the same on x86_64. However, it worsens the native case on power64le (while improving the byteswap case):

https://gcc.godbolt.org/z/ehbad9jqv - The less readable version https://gcc.godbolt.org/z/n5oo75E56 - The more readable version

On the other hand, GCC on power64be has improved its assembly output for the byteswap case, while worsening the native case:

https://gcc.godbolt.org/z/M9fEbWonP - The less readable version https://gcc.godbolt.org/z/cxosn11Gb - The more readable version

I think most end users use power64le over power64be and they definitely use the native case far more than the byteswap case, so this is objectively worse for them. However, I do like the cleaner code and I am hopeful that the superscalar architecture of POWER will mitigate the worse code, so I will adopt the cleaner version.

Edit: Earlier versions of this comment had things inverted since I had swapped the new and old versions by mistake when doing the comparison. That is now fixed.

Nov 30 '22 22:11 ryao

Note that every time I change this, I check the assembly output of:

GCC 12.2.0 for x86_64
Clang 15.0.0 for x86_64
GCC 12.2.0 for power64
GCC 12.2.0 for power64le
Clang 15.0.0 for ppc64
Clang 15.0.0 for ppc64le

For each compiler, I check the assembly for both the native and byteswap versions. Sometimes, I also check the output from older compilers and/or check to see if recent code changes allow for some of the performance hacks I did to be removed. It is a very tedious process since there is a combinatoric explosion. :/

Nov 30 '22 22:11 ryao

aarch64 support has been added. Like the PPC64 support, the aarch64 support is not expected to perform well unless either Clang or GCC 12 or later is used to build it.

Dec 01 '22 03:12 ryao

I am not sure why the Debian 8 ppc64 buildbot is failing to build this. I was told by a volunteer that all that was needed to make this build on ppc64 was CFLAGS_REMOVE_zcommon/zfs_fletcher_generic.o += -msoft-float -mno-vsx. No matter what I push, the buildbot fails to build it. It might just be that the build systems for older kernels do not support forward slashes in GNU make file variables. :/

Dec 01 '22 04:12 ryao

@gyakovlev sent me a link with results for this on POWER9 LE:

https://gist.github.com/gyakovlev/7ed93b90856c2e717e3e587a47e224ba

It seems that a scalar 4 accumulator stream version is outperforming the vector 8 accumulator stream version:

https://github.com/openzfs/zfs/blob/master/module/zcommon/zfs_fletcher_superscalar4.c#L89

That is even after switching to -mpower9-vector, which generates the best assembly code that I have seen for POWER so far. Unless there is an instruction that can load 2x32-bit values into 2x64-bit registers (like Intel's vpmovzxdq for loading 4x 32-bit values into 4x 64-bit registers), I do not see how the assembly output could possibly become better.

The only reason for the lower performance that I could see is that there is no additional parallelism available in the POWER9 core for SIMD to exploit. That would mean that the act of doing SIMD with 8 accumulator streams to try to exploit more parallelism just adds more overhead in the form of both the vector permutation instructions and the additional work to combine the 8 accumulator streams. @edelsohn Does this sound right to you?

Dec 01 '22 15:12 ryao

To be clear, that is to say, from slowest to fastest, you have: vector8 < superscalar4 < vector4?

I wouldn't expect the 8 sum version to be significantly worse like it is in that posted benchmark but I'd have to see what it's ultimately compiling to. I've seen wider SIMD loops get slower for over various reasons (though it's quite possible the extra permutes make it worse, depending on their latency).

Other weird things that can do this (some of which might be specific to Intel):

Instruction alignment for a loop limiting parallelism in the microarchitecture
Adding an additional data dependency bubble
Register spills to stack (probably doubtful in this case)
Execution port limitations (definitely specific to Intel)

Dec 01 '22 16:12 KungFuJesus

zfs zfs copied to clipboard

RFC: Generic Fletcher4 vector implementation

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

zfs
zfs copied to clipboard