tinc
tinc copied to clipboard
ChaCha20: add optimized versions for amd64 (SSSE3 & AVX2)
With the 'new' protocol, ChaCha is taking a decent amount of CPU time, at least in debug build (optimization makes perf output unreadable):
Children Self Command Shared Object Symbol
- 99.72% 0.00% tincd libc.so.6 [.] __libc_init_first
__libc_init_first
- main
- 99.71% main_loop
- 90.08% event_loop
- 69.62% handle_incoming_vpn_data
- 69.20% handle_incoming_vpn_packet
- 68.06% receive_udppacket
- 67.88% sptps_receive_data
- 67.68% sptps_receive_data_datagram
- 49.15% chacha_poly1305_decrypt
39.06% chacha_encrypt_bytes
9.83% poly1305_auth
+ 17.80% receive_sptps_record
+ 0.67% lookup_node_udp
+ 13.34% handle_device_data
+ 4.69% recvmmsg
+ 1.37% handle_meta_io
+ 9.50% epoll_wait
tincd is using the lowest common denominator implementation of this function. Let's add a couple of optimized ones based on compiler intrinsics.
All the hard work has been done by Romain Dolbeau. I just copied it with some adjustments.
Compatibility
x86 / amd64
We'll be shipping three versions of the function (or two, with old compilers without avx2 support):
- generic C implementation
- another one based on SSSE3
- another one based on AVX2
The right one is picked at runtime depending on current CPU capabilities.
Other architectures
Only the old C implementation is used. ARM Neon could be added later.
Benchmarks
- i5-4460
- Linux 5.17.5
- gcc 12.1
- iperf3 in two network namespaces
performanceCPU governor, as few processes as possible, all the basic benchmarking stuff
TL;DR: 20-22% increase in throughput.
bench_chacha.c
Percentage is relative to generic C implementation.
C
32: 2790793.08 op/s
256: 1261587.31 op/s
512: 728262.56 op/s
1024: 390193.12 op/s
16384: 26361.08 op/s
131072: 3320.87 op/s
1048576: 415.73 op/s
SSE
32: 3112408.34 op/s (+11%)
256: 2441758.81 op/s (+93%)
512: 1627719.13 op/s (+123%)
1024: 972969.81 op/s (+149%)
16384: 74304.47 op/s (+181%)
131072: 9427.75 op/s (+183%)
1048576: 1182.82 op/s (+184%)
AVX2
32: 3159181.11 op/s (+13%)
256: 2449003.64 op/s (+94%)
512: 2450859.66 op/s (+236%)
1024: 1628639.74 op/s (+317%)
16384: 145438.38 op/s (+451%)
131072: 18729.81 op/s (+464%)
1048576: 2330.21 op/s (+460%)
Always resolving the correct function (instead of doing it once and storing in a pointer) is a bit slower:
32: 3126362.45 op/s
256: 2395000.08 op/s
512: 2399900.36 op/s
1024: 1600087.45 op/s
16384: 144505.38 op/s
131072: 18464.47 op/s
1048576: 2295.46 op/s
iperf3:
buildtype=release
C
[ 5] 55.00-56.00 sec 115 MBytes 965 Mbits/sec 2 557 KBytes
[ 5] 56.00-57.00 sec 115 MBytes 965 Mbits/sec 1 491 KBytes
[ 5] 57.00-58.00 sec 116 MBytes 975 Mbits/sec 0 648 KBytes
[ 5] 58.00-59.00 sec 115 MBytes 965 Mbits/sec 1 588 KBytes
[ 5] 59.00-60.00 sec 115 MBytes 965 Mbits/sec 2 522 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 6.73 GBytes 963 Mbits/sec 136 sender
[ 5] 0.00-60.00 sec 6.73 GBytes 963 Mbits/sec receiver
SSSE3
[ 5] 55.00-56.00 sec 130 MBytes 1.09 Gbits/sec 25 600 KBytes
[ 5] 56.00-57.00 sec 131 MBytes 1.10 Gbits/sec 2 560 KBytes
[ 5] 57.00-58.00 sec 130 MBytes 1.09 Gbits/sec 2 515 KBytes
[ 5] 58.00-59.00 sec 132 MBytes 1.11 Gbits/sec 0 683 KBytes
[ 5] 59.00-60.00 sec 131 MBytes 1.10 Gbits/sec 2 649 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 7.64 GBytes 1.09 Gbits/sec 2659 sender
[ 5] 0.00-60.00 sec 7.64 GBytes 1.09 Gbits/sec receiver
AVX2
[ 5] 55.00-56.00 sec 142 MBytes 1.20 Gbits/sec 2 602 KBytes
[ 5] 56.00-57.00 sec 141 MBytes 1.19 Gbits/sec 2 574 KBytes
[ 5] 57.00-58.00 sec 142 MBytes 1.20 Gbits/sec 1 550 KBytes
[ 5] 58.00-59.00 sec 142 MBytes 1.20 Gbits/sec 2 520 KBytes
[ 5] 59.00-60.00 sec 142 MBytes 1.20 Gbits/sec 1 494 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 8.29 GBytes 1.19 Gbits/sec 1126 sender
[ 5] 0.00-60.00 sec 8.28 GBytes 1.19 Gbits/sec receiver
I thought that it might be possible that optimizing for a specific CPU or auto-vectorization that is performed at -O3 would remove the need of writing assembly:
buildtype=release + -march=native -mtune=native
C
[ 5] 55.00-56.00 sec 110 MBytes 923 Mbits/sec 1 498 KBytes
[ 5] 56.00-57.00 sec 110 MBytes 923 Mbits/sec 0 646 KBytes
[ 5] 57.00-58.00 sec 110 MBytes 923 Mbits/sec 3 581 KBytes
[ 5] 58.00-59.00 sec 110 MBytes 923 Mbits/sec 2 506 KBytes
[ 5] 59.00-60.00 sec 110 MBytes 923 Mbits/sec 0 650 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 6.40 GBytes 916 Mbits/sec 2579 sender
[ 5] 0.00-60.01 sec 6.40 GBytes 916 Mbits/sec receiver
AVX2
[ 5] 55.00-56.00 sec 141 MBytes 1.18 Gbits/sec 4 649 KBytes
[ 5] 56.00-57.00 sec 142 MBytes 1.20 Gbits/sec 1 626 KBytes
[ 5] 57.00-58.00 sec 142 MBytes 1.20 Gbits/sec 2 602 KBytes
[ 5] 58.00-59.00 sec 142 MBytes 1.20 Gbits/sec 1 571 KBytes
[ 5] 59.00-60.00 sec 141 MBytes 1.18 Gbits/sec 3 539 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 8.30 GBytes 1.19 Gbits/sec 981 sender
[ 5] 0.00-60.00 sec 8.30 GBytes 1.19 Gbits/sec receiver
-O3 + -march=native -mtune=native
C
[ 5] 55.00-56.00 sec 111 MBytes 933 Mbits/sec 0 680 KBytes
[ 5] 56.00-57.00 sec 111 MBytes 933 Mbits/sec 3 619 KBytes
[ 5] 57.00-58.00 sec 112 MBytes 944 Mbits/sec 1 554 KBytes
[ 5] 58.00-59.00 sec 111 MBytes 933 Mbits/sec 0 691 KBytes
[ 5] 59.00-60.00 sec 111 MBytes 933 Mbits/sec 2 634 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 6.48 GBytes 927 Mbits/sec 3267 sender
[ 5] 0.00-60.00 sec 6.47 GBytes 926 Mbits/sec receiver
AVX2
[ 5] 55.00-56.00 sec 139 MBytes 1.16 Gbits/sec 1 639 KBytes
[ 5] 56.00-57.00 sec 139 MBytes 1.16 Gbits/sec 1 607 KBytes
[ 5] 57.00-58.00 sec 138 MBytes 1.15 Gbits/sec 1 578 KBytes
[ 5] 58.00-59.00 sec 139 MBytes 1.16 Gbits/sec 2 546 KBytes
[ 5] 59.00-60.00 sec 138 MBytes 1.15 Gbits/sec 1 510 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 8.01 GBytes 1.15 Gbits/sec 312 sender
[ 5] 0.00-60.00 sec 8.00 GBytes 1.15 Gbits/sec receiver
Not really.
I tried the NEON implementation from the same author. It's doesn't have as much of an edge over the generic one, at least on Neoverse-N1. So iperf3 speeds are comparable.
If anyone has an old Raspberry Pi and wants to test this on a slower CPU — be my guest. The code is in pretty bad shape since it's work in progress, but it does work.
Otherwise, I see no reason to merge NEON support.
generic
32: 2846064.09 op/s
256: 1424432.86 op/s
512: 850478.15 op/s
1024: 473508.42 op/s
16384: 33032.01 op/s
131072: 4146.80 op/s
1048576: 517.55 op/s
[ 5] 55.00-56.00 sec 100 MBytes 839 Mbits/sec 0 658 KBytes
[ 5] 56.00-57.00 sec 101 MBytes 849 Mbits/sec 8 571 KBytes
[ 5] 57.00-58.00 sec 102 MBytes 860 Mbits/sec 2 484 KBytes
[ 5] 58.00-59.00 sec 101 MBytes 849 Mbits/sec 0 622 KBytes
[ 5] 59.00-60.00 sec 101 MBytes 849 Mbits/sec 1 532 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 5.67 GBytes 812 Mbits/sec 2875 sender
[ 5] 0.00-60.00 sec 5.67 GBytes 812 Mbits/sec receiver
NEON
32: 2933899.98 op/s (+3%)
256: 1943603.43 op/s (+36%)
512: 1242317.12 op/s (+46%)
1024: 724495.40 op/s (+53%)
16384: 53621.92 op/s (+62%)
131072: 6754.49 op/s (+62%)
1048576: 844.45 op/s (+63%)
[ 5] 55.00-56.00 sec 101 MBytes 849 Mbits/sec 1 536 KBytes
[ 5] 56.00-57.00 sec 100 MBytes 839 Mbits/sec 0 666 KBytes
[ 5] 57.00-58.00 sec 101 MBytes 849 Mbits/sec 3 584 KBytes
[ 5] 58.00-59.00 sec 100 MBytes 839 Mbits/sec 0 704 KBytes
[ 5] 59.00-60.00 sec 102 MBytes 860 Mbits/sec 2 626 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 5.81 GBytes 832 Mbits/sec 1660 sender
[ 5] 0.00-60.00 sec 5.81 GBytes 832 Mbits/sec receiver
Badly modified `ns_ping.py` for running iperf3
#!/usr/bin/env python3
"""Create two network namespaces and run ping between them."""
import os
import signal
import subprocess as subp
import typing as T
from testlib import external as ext, util, template, cmd
from testlib.log import log
from testlib.proc import Tinc, Script
from testlib.test import Test
util.require_root()
util.require_command("ip", "netns", "list")
util.require_path("/dev/net/tun")
IP_FOO = "192.168.1.1"
IP_BAR = "192.168.1.2"
MASK = 24
def init(ctx: Test) -> T.Tuple[Tinc, Tinc]:
"""Initialize new test nodes."""
foo, bar = ctx.node(), ctx.node()
log.info("create network namespaces")
assert ext.netns_add(foo.name)
assert ext.netns_add(bar.name)
log.info("initialize two nodes")
stdin = f"""
init {foo}
set Port 0
set Subnet {IP_FOO}
set Interface {foo}
set Address localhost
set AutoConnect no
"""
foo.cmd(stdin=stdin)
foo.add_script(Script.TINC_UP, template.make_netns_config(foo.name, IP_FOO, MASK))
foo.start()
stdin = f"""
init {bar}
set Port 0
set Subnet {IP_BAR}
set Interface {bar}
set Address localhost
set AutoConnect no
"""
bar.cmd(stdin=stdin)
bar.add_script(Script.TINC_UP, template.make_netns_config(bar.name, IP_BAR, MASK))
cmd.exchange(foo, bar)
return foo, bar
def ping(namespace: str, ip_addr: str) -> int:
"""Send pings between two network namespaces."""
log.info("pinging node from netns %s at %s", namespace, ip_addr)
proc = subp.run(
["ip", "netns", "exec", namespace, "iperf3", "-t", "60", "-c", ip_addr], check=False
)
log.info("ping finished with code %d", proc.returncode)
return proc.returncode
with Test("ns-ping") as context:
foo_node, bar_node = init(context)
bar_node.cmd("start")
log.info("waiting for nodes to come up")
bar_node[Script.TINC_UP].wait()
log.info("add script foo/host-up")
bar_node.add_script(foo_node.script_up)
log.info("add ConnectTo clause")
bar_node.cmd("add", "ConnectTo", foo_node.name)
log.info("bar waits for foo")
bar_node[foo_node.script_up].wait()
subp.Popen(["ip", "netns", "exec", bar_node.name, "iperf3", "-s"], stdout=subp.DEVNULL, stderr=subp.DEVNULL)
log.info("ping must work after connection is up")
assert not ping(foo_node.name, IP_BAR)
The benchmark code you wrote is not really measuring the usage pattern tinc has. The sptps_speed program (which you might have to explicitly build using ninja -C build src/sptps_speed) tests the SPTPS protocol more thoroughly, using typical MTU-sized packets.
Performance measurements using debug builds are mostly useless. Have you tried configuring the build system with -Dbuildtype=debugoptimized?
I also have a branch (PR #360) that adds AES-256-GCM support to SPTPS, which depends on OpenSSL, but it will then also use OpenSSL for Chacha20-Poly1305. It might be interesting to compare the speed with this optimized version as well.
Oh yes. If you're willing to reintroduce dependency on libssl, it would of course be best. We get highly optimized code for all possible architectures for free.
TL;DR:
| flavor | throughput |
|---|---|
| generic | 0.974 Gbit/s |
| avx2 | 1.19 Gbit/s |
| libssl1.1 | 1.28 Gbit/s |
| libssl3.1 | 1.23 Gbit/s* |
*: probably a debug build
detailed results
generic
Generating keys for 10 seconds: 27182.71 op/s
Ed25519 sign for 10 seconds: 25932.53 op/s
Ed25519 verify for 10 seconds: 8360.30 op/s
ECDH for 10 seconds: 6783.36 op/s
SPTPS/TCP authenticate for 10 seconds: 3173.35 op/s
SPTPS/TCP transmit for 10 seconds: 2.47 Gbit/s
SPTPS/UDP authenticate for 10 seconds: 3167.00 op/s
SPTPS/UDP transmit for 10 seconds: 2.46 Gbit/s
[ 5] 55.00-56.00 sec 116 MBytes 975 Mbits/sec 0 686 KBytes
[ 5] 56.00-57.00 sec 119 MBytes 996 Mbits/sec 1 634 KBytes
[ 5] 57.00-58.00 sec 118 MBytes 986 Mbits/sec 2 573 KBytes
[ 5] 58.00-59.00 sec 118 MBytes 986 Mbits/sec 1 506 KBytes
[ 5] 59.00-60.00 sec 118 MBytes 986 Mbits/sec 0 662 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 6.80 GBytes 974 Mbits/sec 5725 sender
[ 5] 0.00-60.01 sec 6.80 GBytes 973 Mbits/sec receiver
avx2
Generating keys for 10 seconds: 27313.96 op/s
Ed25519 sign for 10 seconds: 26049.17 op/s
Ed25519 verify for 10 seconds: 8411.54 op/s
ECDH for 10 seconds: 6792.54 op/s
SPTPS/TCP authenticate for 10 seconds: 3156.87 op/s
SPTPS/TCP transmit for 10 seconds: 4.00 Gbit/s
SPTPS/UDP authenticate for 10 seconds: 3160.10 op/s
SPTPS/UDP transmit for 10 seconds: 4.13 Gbit/s
[ 5] 25.00-26.00 sec 142 MBytes 1.20 Gbits/sec 1 509 KBytes
[ 5] 26.00-27.00 sec 142 MBytes 1.20 Gbits/sec 0 691 KBytes
[ 5] 27.00-28.00 sec 142 MBytes 1.20 Gbits/sec 1 672 KBytes
[ 5] 28.00-29.00 sec 142 MBytes 1.20 Gbits/sec 1 649 KBytes
[ 5] 29.00-30.00 sec 142 MBytes 1.20 Gbits/sec 2 619 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 4.12 GBytes 1.19 Gbits/sec 273 sender
[ 5] 0.00-30.00 sec 4.12 GBytes 1.19 Gbits/sec receiver
alt (OpenSSL 1.1.1.o)
Generating keys for 10 seconds: 27294.88 op/s
Ed25519 sign for 10 seconds: 26004.59 op/s
Ed25519 verify for 10 seconds: 8409.03 op/s
ECDH for 10 seconds: 6786.60 op/s
Cipher suite 0 (Chacha20-Poly1305):
SPTPS/TCP authenticate for 10 seconds: 3182.28 op/s
SPTPS/TCP transmit for 10 seconds: 6.24 Gbit/s
SPTPS/UDP authenticate for 10 seconds: 3178.83 op/s
SPTPS/UDP transmit for 10 seconds: 6.16 Gbit/s
Cipher suite 1 (AES-256-GCM):
SPTPS/TCP authenticate for 10 seconds: 3177.70 op/s
SPTPS/TCP transmit for 10 seconds: 8.89 Gbit/s
SPTPS/UDP authenticate for 10 seconds: 3169.44 op/s
SPTPS/UDP transmit for 10 seconds: 8.73 Gbit/s
[ 5] 25.00-26.00 sec 154 MBytes 1.29 Gbits/sec 1 628 KBytes
[ 5] 26.00-27.00 sec 154 MBytes 1.29 Gbits/sec 4 619 KBytes
[ 5] 27.00-28.00 sec 155 MBytes 1.30 Gbits/sec 2 611 KBytes
[ 5] 28.00-29.00 sec 155 MBytes 1.30 Gbits/sec 2 595 KBytes
[ 5] 29.00-30.00 sec 154 MBytes 1.29 Gbits/sec 1 585 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 4.42 GBytes 1.28 Gbits/sec 3427 sender
[ 5] 0.00-30.00 sec 4.41 GBytes 1.28 Gbits/sec receiver
alt (OpenSSL 3.1)
I don't remember what flags were used to build the library. It may be a debug build, so don't pay much attention to it.
Generating keys for 10 seconds: 27368.60 op/s
Ed25519 sign for 10 seconds: 26068.22 op/s
Ed25519 verify for 10 seconds: 8434.89 op/s
ECDH for 10 seconds: 6839.51 op/s
Cipher suite 0 (Chacha20-Poly1305):
SPTPS/TCP authenticate for 10 seconds: 3152.25 op/s
SPTPS/TCP transmit for 10 seconds: 5.83 Gbit/s
SPTPS/UDP authenticate for 10 seconds: 3149.35 op/s
SPTPS/UDP transmit for 10 seconds: 5.77 Gbit/s
Cipher suite 1 (AES-256-GCM):
SPTPS/TCP authenticate for 10 seconds: 3151.33 op/s
SPTPS/TCP transmit for 10 seconds: 7.69 Gbit/s
SPTPS/UDP authenticate for 10 seconds: 3149.60 op/s
SPTPS/UDP transmit for 10 seconds: 7.88 Gbit/s
[ 5] 25.00-26.00 sec 148 MBytes 1.24 Gbits/sec 4 670 KBytes
[ 5] 26.00-27.00 sec 148 MBytes 1.24 Gbits/sec 2 649 KBytes
[ 5] 27.00-28.00 sec 148 MBytes 1.24 Gbits/sec 1 632 KBytes
[ 5] 28.00-29.00 sec 148 MBytes 1.24 Gbits/sec 2 609 KBytes
[ 5] 29.00-30.00 sec 148 MBytes 1.24 Gbits/sec 1 591 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 4.24 GBytes 1.23 Gbits/sec 973 sender
[ 5] 0.00-30.00 sec 4.24 GBytes 1.22 Gbits/sec receiver
I'll leave it here for now until after #360 is merged.