snabb icon indicating copy to clipboard operation
snabb copied to clipboard

[wip] IP checksum in AVX2 assembler (prototype rewrite)

Open lukego opened this issue 9 years ago • 7 comments
trafficstars

This branch rewrites the AVX2 checksum routine with DynASM assembler instead of GCC intrinsics.

My motivations and perceived benefits are a little subjective:

  • Practice writing assembler code with DynASM - a powerful secret weapon.
  • Practice using x86 SIMD instructions - untapped potential growing exponentially.
  • Reduce reliance on GCC - a boring complex external dependency.
  • Get Snabb to compile on older Redhat/CentOS - GCC lacks AVX2 support.
  • Get to know this checksum routine better - originally ported from another project.
  • Have a chance to try and squeeze more performance.

I will be satisfied if the assembler version is at least as short as the C version and also at least as fast.

Looks promising so far. The version here is basically working but seems to be missing a carry bit somewhere (off-by-one on some tests). The code is a little over half the size of the C version. The performance seems at least as good.

Just have simple microbenchmarks for now. I would like to get a more thorough performance test like #755 upstream to verify this change. Meanwhile, here is how it looks in comparison to the C code based on the microbenchmark included on the PR (1 million iterations on the same input). On case with 150 byte input:

ASM
EVENT                                             TOTAL       /call      /block       /byte
cycles                                       45,552,172      45.552       9.718       0.304
instructions                                 84,139,687      84.140      17.950       0.561
call                                          1,000,000       1.000       0.213       0.007
block                                         4,687,500       4.688       1.000       0.031
byte                                        150,000,000     150.000      32.000       1.000

C
EVENT                                             TOTAL       /call      /block       /byte
cycles                                       64,792,114      64.792      13.822       0.432
instructions                                248,176,655     248.177      52.944       1.655
call                                          1,000,000       1.000       0.213       0.007
block                                         4,687,500       4.688       1.000       0.031
byte                                        150,000,000     150.000      32.000       1.000

Here is a little table with some other values:

Code vs Bytes 31 32 100 150 500 1,500 5,000 10,000 100,000
C (cycles) 84* 85* 263* 65 96 206 498 970 10,005
ASM (cycles) 44 19 47 46 51 116 402 772 9,085

EDIT: Marked with * the small sizes where our C code actually punts to non-SIMD.

So: fun and encouraging so far but more work to be done. Could still be some big mistakes that invalidate this code and/or results for now.

cc @tonyrog and @jsnell who may also be interested.

lukego avatar Apr 27 '16 12:04 lukego

Fixed the bug with 4993c37. This code works fine up to 128KB inputs in casual testing. That limitation seems okay to me i.e. not worth writing more code to increase it because packets are not that big.

The next step is to integrate and test/benchmark more extensively both with synthetic benchmarks and end-to-end tests (offloading checksums from QEMU VMs).

lukego avatar Apr 28 '16 04:04 lukego

Added snabbmark checksum benchmark:

$ taskset -c 1 sudo ./snabb snabbmark checksum
VARIANT          BYTES/PACKET    BYTES/CYCLE  CYCLES/PACKET
base                  631.331          0.326       1935.081
asm                   631.331          4.244        148.770
avx2                  631.331          2.743        230.180
sse2                  631.331          2.318        272.319

This is intended to be fairly harsh and realistic. The input sizes, contents, and alignments are randomized. I draw the input sizes from a log-uniform distribution which is my current favorite for packet sizes (mostly small but also including large and jumbo sizes). I also update the assembler routine to have the same interface as the others (added the initial argument).

The assembler routine shows the best results by far. The older AVX routine is likely suffering from the logic that selectively falls back on the generic routine on small inputs (both for the unpredictable branch and because the current implementation of generic checksum is beyond awful -- should fix that as a matter of principle even if we are using the SIMD one in practice.)

This is starting to look like effort well spent! IP checksum is the main hotspot for Virtio-net with client/server workloads e.g. running iperf in VM. Cycles saved here should translate directly into extra capacity for the NFV application.

lukego avatar Apr 30 '16 17:04 lukego

This branch is taking a little bit of a different turn:

I found a fairly straightforward formulation of checksum in C that GCC is able to automatically vectorize when compiled with -O3. I hacked the Makefile to compile this function twice: once as cksum_avx2() using -mavx2 and once as cksum() using default settings (i.e. SSE). This is simpler and faster than the hand-coded SSE and AVX routines based on C intrinsics so I have removed those.

I have retained the AVX2 assembler variant as this is still the fastest by a significant margin.

Current scoreboard:

VARIANT          BYTES/PACKET    BYTES/CYCLE  CYCLES/PACKET
base                  631.331          2.921        216.117
asm                   631.331          4.172        151.310
avx2                  631.331          3.435        183.775

The next step is to eliminate either the C/AVX2 implementation of the assembler one. The open problem for the assembler one right now is the wart that it temporarily overwrites the memory trailing the input which is a different and more complex interface that may not suit all usages. The open problem for the C/AVX2 implementation is that it is slower than the assembler.

lukego avatar May 01 '16 07:05 lukego

My usual worry with auto-vectorization for critical hotspots is that at some point it'll fail to trigger for whatever reason, and you'll end up silently running the naive version of the code. It's ok in a closed environment where the compiler versions are guaranteed to be fixed and only get upgraded at very specific points in time, but more problematic when you as the author have little control over how this gets compiled.

Though the simplicity of that new C code is pretty sweet.

jsnell avatar May 01 '16 18:05 jsnell

@jsnell Yes, I know what you mean. In Snabb we pin exact versions of LuaJIT and DynASM so that we can "geek out" on them but have tried to be conservative with gcc, glibc, etc. I am not especially comfortable with either the auto-vectorized nor the vectorized-with-C-intrinsics versions.

I imagine it is interesting in other projects where they are dedicated C hackers and pick one compiler (ICC, CLANG, or GCC) and geek out on its features for vectorization etc. at least until the distros come along and decide to use a different compiler, compile for a generic target architecture, skip the performance tests, etc :).

Questions I am struggling with now are:

  1. Is the trick of zero-padding the input good (simpler code thanks to known properties of inputs) or bad (complex interface due to lazy coding)?
  2. Is it worth writing an SSE assembler version by hand to reduce dependence on GCC?
  3. How does performance compare with theoretical limits? i.e. based on throughput and latency of the instructions involved (how busy is it keeping the ALUs and what is the limiting factor on performance?)

lukego avatar May 02 '16 12:05 lukego

What is current status?

AkihiroSuda avatar Aug 03 '18 07:08 AkihiroSuda

@AkihiroSuda @dpino is hacking on a fast IP checksum implementation in the same spirit here: https://github.com/snabbco/snabb/pull/1275

eugeneia avatar Aug 15 '18 09:08 eugeneia