tinc icon indicating copy to clipboard operation
tinc copied to clipboard

ChaCha20: add optimized versions for amd64 (SSSE3 & AVX2)

Open hg opened this issue 3 years ago • 3 comments
trafficstars

With the 'new' protocol, ChaCha is taking a decent amount of CPU time, at least in debug build (optimization makes perf output unreadable):

Children      Self  Command  Shared Object         Symbol
-   99.72%     0.00%  tincd    libc.so.6             [.] __libc_init_first
     __libc_init_first
   - main
      - 99.71% main_loop
         - 90.08% event_loop
            - 69.62% handle_incoming_vpn_data
               - 69.20% handle_incoming_vpn_packet
                  - 68.06% receive_udppacket
                     - 67.88% sptps_receive_data
                        - 67.68% sptps_receive_data_datagram
                           - 49.15% chacha_poly1305_decrypt
                                39.06% chacha_encrypt_bytes
                                9.83% poly1305_auth
                           + 17.80% receive_sptps_record
                  + 0.67% lookup_node_udp
            + 13.34% handle_device_data
            + 4.69% recvmmsg
            + 1.37% handle_meta_io
         + 9.50% epoll_wait

tincd is using the lowest common denominator implementation of this function. Let's add a couple of optimized ones based on compiler intrinsics.

All the hard work has been done by Romain Dolbeau. I just copied it with some adjustments.

Compatibility

x86 / amd64

We'll be shipping three versions of the function (or two, with old compilers without avx2 support):

  • generic C implementation
  • another one based on SSSE3
  • another one based on AVX2

The right one is picked at runtime depending on current CPU capabilities.

Other architectures

Only the old C implementation is used. ARM Neon could be added later.

Benchmarks

  • i5-4460
  • Linux 5.17.5
  • gcc 12.1
  • iperf3 in two network namespaces
  • performance CPU governor, as few processes as possible, all the basic benchmarking stuff

TL;DR: 20-22% increase in throughput.

bench_chacha.c

Percentage is relative to generic C implementation.

C

      32:     2790793.08 op/s
     256:     1261587.31 op/s
     512:      728262.56 op/s
    1024:      390193.12 op/s
   16384:       26361.08 op/s
  131072:        3320.87 op/s
 1048576:         415.73 op/s

SSE

      32:     3112408.34 op/s (+11%)
     256:     2441758.81 op/s (+93%)
     512:     1627719.13 op/s (+123%)
    1024:      972969.81 op/s (+149%)
   16384:       74304.47 op/s (+181%)
  131072:        9427.75 op/s (+183%)
 1048576:        1182.82 op/s (+184%)

AVX2

      32:     3159181.11 op/s (+13%)
     256:     2449003.64 op/s (+94%)
     512:     2450859.66 op/s (+236%)
    1024:     1628639.74 op/s (+317%)
   16384:      145438.38 op/s (+451%)
  131072:       18729.81 op/s (+464%)
 1048576:        2330.21 op/s (+460%)

Always resolving the correct function (instead of doing it once and storing in a pointer) is a bit slower:

      32:     3126362.45 op/s
     256:     2395000.08 op/s
     512:     2399900.36 op/s
    1024:     1600087.45 op/s
   16384:      144505.38 op/s
  131072:       18464.47 op/s
 1048576:        2295.46 op/s

iperf3:

buildtype=release

C

[  5]  55.00-56.00  sec   115 MBytes   965 Mbits/sec    2    557 KBytes
[  5]  56.00-57.00  sec   115 MBytes   965 Mbits/sec    1    491 KBytes
[  5]  57.00-58.00  sec   116 MBytes   975 Mbits/sec    0    648 KBytes
[  5]  58.00-59.00  sec   115 MBytes   965 Mbits/sec    1    588 KBytes
[  5]  59.00-60.00  sec   115 MBytes   965 Mbits/sec    2    522 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  6.73 GBytes   963 Mbits/sec  136             sender
[  5]   0.00-60.00  sec  6.73 GBytes   963 Mbits/sec                  receiver

SSSE3

[  5]  55.00-56.00  sec   130 MBytes  1.09 Gbits/sec   25    600 KBytes
[  5]  56.00-57.00  sec   131 MBytes  1.10 Gbits/sec    2    560 KBytes
[  5]  57.00-58.00  sec   130 MBytes  1.09 Gbits/sec    2    515 KBytes
[  5]  58.00-59.00  sec   132 MBytes  1.11 Gbits/sec    0    683 KBytes
[  5]  59.00-60.00  sec   131 MBytes  1.10 Gbits/sec    2    649 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  7.64 GBytes  1.09 Gbits/sec  2659             sender
[  5]   0.00-60.00  sec  7.64 GBytes  1.09 Gbits/sec                  receiver

AVX2

[  5]  55.00-56.00  sec   142 MBytes  1.20 Gbits/sec    2    602 KBytes
[  5]  56.00-57.00  sec   141 MBytes  1.19 Gbits/sec    2    574 KBytes
[  5]  57.00-58.00  sec   142 MBytes  1.20 Gbits/sec    1    550 KBytes
[  5]  58.00-59.00  sec   142 MBytes  1.20 Gbits/sec    2    520 KBytes
[  5]  59.00-60.00  sec   142 MBytes  1.20 Gbits/sec    1    494 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  8.29 GBytes  1.19 Gbits/sec  1126             sender
[  5]   0.00-60.00  sec  8.28 GBytes  1.19 Gbits/sec                  receiver

I thought that it might be possible that optimizing for a specific CPU or auto-vectorization that is performed at -O3 would remove the need of writing assembly:

buildtype=release + -march=native -mtune=native

C

[  5]  55.00-56.00  sec   110 MBytes   923 Mbits/sec    1    498 KBytes
[  5]  56.00-57.00  sec   110 MBytes   923 Mbits/sec    0    646 KBytes
[  5]  57.00-58.00  sec   110 MBytes   923 Mbits/sec    3    581 KBytes
[  5]  58.00-59.00  sec   110 MBytes   923 Mbits/sec    2    506 KBytes
[  5]  59.00-60.00  sec   110 MBytes   923 Mbits/sec    0    650 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  6.40 GBytes   916 Mbits/sec  2579             sender
[  5]   0.00-60.01  sec  6.40 GBytes   916 Mbits/sec                  receiver

AVX2

[  5]  55.00-56.00  sec   141 MBytes  1.18 Gbits/sec    4    649 KBytes
[  5]  56.00-57.00  sec   142 MBytes  1.20 Gbits/sec    1    626 KBytes
[  5]  57.00-58.00  sec   142 MBytes  1.20 Gbits/sec    2    602 KBytes
[  5]  58.00-59.00  sec   142 MBytes  1.20 Gbits/sec    1    571 KBytes
[  5]  59.00-60.00  sec   141 MBytes  1.18 Gbits/sec    3    539 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  8.30 GBytes  1.19 Gbits/sec  981             sender
[  5]   0.00-60.00  sec  8.30 GBytes  1.19 Gbits/sec                  receiver
-O3 + -march=native -mtune=native

C

[  5]  55.00-56.00  sec   111 MBytes   933 Mbits/sec    0    680 KBytes
[  5]  56.00-57.00  sec   111 MBytes   933 Mbits/sec    3    619 KBytes
[  5]  57.00-58.00  sec   112 MBytes   944 Mbits/sec    1    554 KBytes
[  5]  58.00-59.00  sec   111 MBytes   933 Mbits/sec    0    691 KBytes
[  5]  59.00-60.00  sec   111 MBytes   933 Mbits/sec    2    634 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  6.48 GBytes   927 Mbits/sec  3267             sender
[  5]   0.00-60.00  sec  6.47 GBytes   926 Mbits/sec                  receiver

AVX2

[  5]  55.00-56.00  sec   139 MBytes  1.16 Gbits/sec    1    639 KBytes
[  5]  56.00-57.00  sec   139 MBytes  1.16 Gbits/sec    1    607 KBytes
[  5]  57.00-58.00  sec   138 MBytes  1.15 Gbits/sec    1    578 KBytes
[  5]  58.00-59.00  sec   139 MBytes  1.16 Gbits/sec    2    546 KBytes
[  5]  59.00-60.00  sec   138 MBytes  1.15 Gbits/sec    1    510 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  8.01 GBytes  1.15 Gbits/sec  312             sender
[  5]   0.00-60.00  sec  8.00 GBytes  1.15 Gbits/sec                  receiver

Not really.

hg avatar May 22 '22 14:05 hg

I tried the NEON implementation from the same author. It's doesn't have as much of an edge over the generic one, at least on Neoverse-N1. So iperf3 speeds are comparable.

If anyone has an old Raspberry Pi and wants to test this on a slower CPU — be my guest. The code is in pretty bad shape since it's work in progress, but it does work.

Otherwise, I see no reason to merge NEON support.

generic

      32:     2846064.09 op/s
     256:     1424432.86 op/s
     512:      850478.15 op/s
    1024:      473508.42 op/s
   16384:       33032.01 op/s
  131072:        4146.80 op/s
 1048576:         517.55 op/s
[  5]  55.00-56.00  sec   100 MBytes   839 Mbits/sec    0    658 KBytes
[  5]  56.00-57.00  sec   101 MBytes   849 Mbits/sec    8    571 KBytes
[  5]  57.00-58.00  sec   102 MBytes   860 Mbits/sec    2    484 KBytes
[  5]  58.00-59.00  sec   101 MBytes   849 Mbits/sec    0    622 KBytes
[  5]  59.00-60.00  sec   101 MBytes   849 Mbits/sec    1    532 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  5.67 GBytes   812 Mbits/sec  2875             sender
[  5]   0.00-60.00  sec  5.67 GBytes   812 Mbits/sec                  receiver

NEON

       32:     2933899.98 op/s (+3%)
      256:     1943603.43 op/s (+36%)
      512:     1242317.12 op/s (+46%)
     1024:      724495.40 op/s (+53%)
    16384:       53621.92 op/s (+62%)
   131072:        6754.49 op/s (+62%)
  1048576:         844.45 op/s (+63%)
[  5]  55.00-56.00  sec   101 MBytes   849 Mbits/sec    1    536 KBytes
[  5]  56.00-57.00  sec   100 MBytes   839 Mbits/sec    0    666 KBytes
[  5]  57.00-58.00  sec   101 MBytes   849 Mbits/sec    3    584 KBytes
[  5]  58.00-59.00  sec   100 MBytes   839 Mbits/sec    0    704 KBytes
[  5]  59.00-60.00  sec   102 MBytes   860 Mbits/sec    2    626 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  5.81 GBytes   832 Mbits/sec  1660             sender
[  5]   0.00-60.00  sec  5.81 GBytes   832 Mbits/sec                  receiver
Badly modified `ns_ping.py` for running iperf3
#!/usr/bin/env python3

"""Create two network namespaces and run ping between them."""

import os
import signal
import subprocess as subp
import typing as T

from testlib import external as ext, util, template, cmd
from testlib.log import log
from testlib.proc import Tinc, Script
from testlib.test import Test

util.require_root()
util.require_command("ip", "netns", "list")
util.require_path("/dev/net/tun")

IP_FOO = "192.168.1.1"
IP_BAR = "192.168.1.2"
MASK = 24


def init(ctx: Test) -> T.Tuple[Tinc, Tinc]:
    """Initialize new test nodes."""
    foo, bar = ctx.node(), ctx.node()

    log.info("create network namespaces")
    assert ext.netns_add(foo.name)
    assert ext.netns_add(bar.name)

    log.info("initialize two nodes")

    stdin = f"""
        init {foo}
        set Port 0
        set Subnet {IP_FOO}
        set Interface {foo}
        set Address localhost
        set AutoConnect no
    """
    foo.cmd(stdin=stdin)
    foo.add_script(Script.TINC_UP, template.make_netns_config(foo.name, IP_FOO, MASK))
    foo.start()

    stdin = f"""
        init {bar}
        set Port 0
        set Subnet {IP_BAR}
        set Interface {bar}
        set Address localhost
        set AutoConnect no
    """
    bar.cmd(stdin=stdin)
    bar.add_script(Script.TINC_UP, template.make_netns_config(bar.name, IP_BAR, MASK))

    cmd.exchange(foo, bar)

    return foo, bar


def ping(namespace: str, ip_addr: str) -> int:
    """Send pings between two network namespaces."""
    log.info("pinging node from netns %s at %s", namespace, ip_addr)

    proc = subp.run(
        ["ip", "netns", "exec", namespace, "iperf3", "-t", "60", "-c", ip_addr], check=False
    )

    log.info("ping finished with code %d", proc.returncode)
    return proc.returncode


with Test("ns-ping") as context:
    foo_node, bar_node = init(context)
    bar_node.cmd("start")

    log.info("waiting for nodes to come up")
    bar_node[Script.TINC_UP].wait()

    log.info("add script foo/host-up")
    bar_node.add_script(foo_node.script_up)

    log.info("add ConnectTo clause")
    bar_node.cmd("add", "ConnectTo", foo_node.name)

    log.info("bar waits for foo")
    bar_node[foo_node.script_up].wait()

    subp.Popen(["ip", "netns", "exec", bar_node.name, "iperf3", "-s"], stdout=subp.DEVNULL, stderr=subp.DEVNULL)

    log.info("ping must work after connection is up")
    assert not ping(foo_node.name, IP_BAR)

hg avatar May 22 '22 16:05 hg

The benchmark code you wrote is not really measuring the usage pattern tinc has. The sptps_speed program (which you might have to explicitly build using ninja -C build src/sptps_speed) tests the SPTPS protocol more thoroughly, using typical MTU-sized packets.

Performance measurements using debug builds are mostly useless. Have you tried configuring the build system with -Dbuildtype=debugoptimized?

I also have a branch (PR #360) that adds AES-256-GCM support to SPTPS, which depends on OpenSSL, but it will then also use OpenSSL for Chacha20-Poly1305. It might be interesting to compare the speed with this optimized version as well.

gsliepen avatar May 22 '22 19:05 gsliepen

Oh yes. If you're willing to reintroduce dependency on libssl, it would of course be best. We get highly optimized code for all possible architectures for free.

TL;DR:

flavor throughput
generic 0.974 Gbit/s
avx2 1.19 Gbit/s
libssl1.1 1.28 Gbit/s
libssl3.1 1.23 Gbit/s*

*: probably a debug build

detailed results

generic

Generating keys for 10 seconds:          27182.71 op/s
Ed25519 sign for 10 seconds:             25932.53 op/s
Ed25519 verify for 10 seconds:            8360.30 op/s
ECDH for 10 seconds:                      6783.36 op/s
SPTPS/TCP authenticate for 10 seconds:    3173.35 op/s
SPTPS/TCP transmit for 10 seconds:           2.47 Gbit/s
SPTPS/UDP authenticate for 10 seconds:    3167.00 op/s
SPTPS/UDP transmit for 10 seconds:           2.46 Gbit/s

[  5]  55.00-56.00  sec   116 MBytes   975 Mbits/sec    0    686 KBytes
[  5]  56.00-57.00  sec   119 MBytes   996 Mbits/sec    1    634 KBytes
[  5]  57.00-58.00  sec   118 MBytes   986 Mbits/sec    2    573 KBytes
[  5]  58.00-59.00  sec   118 MBytes   986 Mbits/sec    1    506 KBytes
[  5]  59.00-60.00  sec   118 MBytes   986 Mbits/sec    0    662 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  6.80 GBytes   974 Mbits/sec  5725             sender
[  5]   0.00-60.01  sec  6.80 GBytes   973 Mbits/sec                  receiver

avx2

Generating keys for 10 seconds:          27313.96 op/s
Ed25519 sign for 10 seconds:             26049.17 op/s
Ed25519 verify for 10 seconds:            8411.54 op/s
ECDH for 10 seconds:                      6792.54 op/s
SPTPS/TCP authenticate for 10 seconds:    3156.87 op/s
SPTPS/TCP transmit for 10 seconds:           4.00 Gbit/s
SPTPS/UDP authenticate for 10 seconds:    3160.10 op/s
SPTPS/UDP transmit for 10 seconds:           4.13 Gbit/s
[  5]  25.00-26.00  sec   142 MBytes  1.20 Gbits/sec    1    509 KBytes
[  5]  26.00-27.00  sec   142 MBytes  1.20 Gbits/sec    0    691 KBytes
[  5]  27.00-28.00  sec   142 MBytes  1.20 Gbits/sec    1    672 KBytes
[  5]  28.00-29.00  sec   142 MBytes  1.20 Gbits/sec    1    649 KBytes
[  5]  29.00-30.00  sec   142 MBytes  1.20 Gbits/sec    2    619 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  4.12 GBytes  1.19 Gbits/sec  273             sender
[  5]   0.00-30.00  sec  4.12 GBytes  1.19 Gbits/sec                  receiver

alt (OpenSSL 1.1.1.o)

Generating keys for 10 seconds:          27294.88 op/s
Ed25519 sign for 10 seconds:             26004.59 op/s
Ed25519 verify for 10 seconds:            8409.03 op/s
ECDH for 10 seconds:                      6786.60 op/s

Cipher suite 0 (Chacha20-Poly1305):
SPTPS/TCP authenticate for 10 seconds:    3182.28 op/s
SPTPS/TCP transmit for 10 seconds:           6.24 Gbit/s
SPTPS/UDP authenticate for 10 seconds:    3178.83 op/s
SPTPS/UDP transmit for 10 seconds:           6.16 Gbit/s

Cipher suite 1 (AES-256-GCM):
SPTPS/TCP authenticate for 10 seconds:    3177.70 op/s
SPTPS/TCP transmit for 10 seconds:           8.89 Gbit/s
SPTPS/UDP authenticate for 10 seconds:    3169.44 op/s
SPTPS/UDP transmit for 10 seconds:           8.73 Gbit/s
[  5]  25.00-26.00  sec   154 MBytes  1.29 Gbits/sec    1    628 KBytes
[  5]  26.00-27.00  sec   154 MBytes  1.29 Gbits/sec    4    619 KBytes
[  5]  27.00-28.00  sec   155 MBytes  1.30 Gbits/sec    2    611 KBytes
[  5]  28.00-29.00  sec   155 MBytes  1.30 Gbits/sec    2    595 KBytes
[  5]  29.00-30.00  sec   154 MBytes  1.29 Gbits/sec    1    585 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  4.42 GBytes  1.28 Gbits/sec  3427             sender
[  5]   0.00-30.00  sec  4.41 GBytes  1.28 Gbits/sec                  receiver

alt (OpenSSL 3.1)

I don't remember what flags were used to build the library. It may be a debug build, so don't pay much attention to it.

Generating keys for 10 seconds:          27368.60 op/s
Ed25519 sign for 10 seconds:             26068.22 op/s
Ed25519 verify for 10 seconds:            8434.89 op/s
ECDH for 10 seconds:                      6839.51 op/s

Cipher suite 0 (Chacha20-Poly1305):
SPTPS/TCP authenticate for 10 seconds:    3152.25 op/s
SPTPS/TCP transmit for 10 seconds:           5.83 Gbit/s
SPTPS/UDP authenticate for 10 seconds:    3149.35 op/s
SPTPS/UDP transmit for 10 seconds:           5.77 Gbit/s

Cipher suite 1 (AES-256-GCM):
SPTPS/TCP authenticate for 10 seconds:    3151.33 op/s
SPTPS/TCP transmit for 10 seconds:           7.69 Gbit/s
SPTPS/UDP authenticate for 10 seconds:    3149.60 op/s
SPTPS/UDP transmit for 10 seconds:           7.88 Gbit/s
[  5]  25.00-26.00  sec   148 MBytes  1.24 Gbits/sec    4    670 KBytes
[  5]  26.00-27.00  sec   148 MBytes  1.24 Gbits/sec    2    649 KBytes
[  5]  27.00-28.00  sec   148 MBytes  1.24 Gbits/sec    1    632 KBytes
[  5]  28.00-29.00  sec   148 MBytes  1.24 Gbits/sec    2    609 KBytes
[  5]  29.00-30.00  sec   148 MBytes  1.24 Gbits/sec    1    591 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  4.24 GBytes  1.23 Gbits/sec  973             sender
[  5]   0.00-30.00  sec  4.24 GBytes  1.22 Gbits/sec                  receiver

I'll leave it here for now until after #360 is merged.

hg avatar May 23 '22 07:05 hg