tinc icon indicating copy to clipboard operation
tinc copied to clipboard

High CPU load on cloud VMs

Open nh2 opened this issue 8 years ago • 59 comments

This weekend I benchmarked Tinc (multiple versions) on the smallest DigitalOcean instance.

I found that it generates very high CPU load (much more than on a desktop), and network performance maxes out at around 200 MBit/s with iperf3 from one Tinc node to another, when the raw interface can deliver 1000 Mbit/s without noticeable CPU load.

This is a write up of the little investigation I did, maybe as a help if some third party Tinc contributor wants to try and solve this problem (or if I find the time for it myself at some point).

Note that I'm extremely new to Tinc (I started reading into the code this weekend), so some of this may not be 100% accurate, but @gsliepen was extremely helpful in answering my questions on the #tinc IRC channel - thanks for this.

There seem to be two things that make Tinc slow on these machines:

  • Encryption:
    • The latest Tinc 1.1 branch switched from AES+SHA to ChaCha-Poly1305 (also, the Cipher and Digest config options are no longer configurable and are ignored). While ChaCha is faster than AES on standard machines, it cannot compete with the AES-NI hardware acceleration for AES that many modern CPU (and also those cloud machines) provide; the speed difference is approximately 4x (can also be tested with Tinc's sptps_speed utility).
    • ChaCha takes around 65% of CPU time in perf (make sure to use CFLAGS=-fno-omit-frame-pointer ./configure for measuring this).
    • Re-enabling AES would make the symmetric encryption overhead pretty much negligible, at least on Gigabit-Link machines.
    • But according to @gsliepen, this is not so easy / there is trade-off involved, because the OpenSSL interface that Tinc used in the past for AES(-NI) is not quite stable. Maybe libsodium could be used instead? No software fallback seems to be provided though.
  • Syscalls:
    • When watching Tinc in htop while iperf3 is running, there's a lot of red, meaning time spent in the kernel. This is even more prominent when Cipher = none and Digest = none, which is still possible in Tinc 1.0.

    • In htop when enabling Display options -> Detailed CPU I can see that around 25-30% of the time is spent in softirq processing (violet colour). This also hints that lots of expensive syscalls are done (these are not as expensive on physical machines, but on some forms of virtualisation, it can be very expensive).

    • Changing the MTU might be a quick fix here, but one can't change the MTU to be higher than 1500 on DigitalOcean instances (then no UDP packets go through); AWS EC2 supports Jumbo Frames with MTU = 9000 (and Tinc has a ./configure option for that), but AWS EC2 only supports it on specific instances; it's not a general purpose workaround (definitely not once you send data outside of your data center).

    • The best way to improve this is to reduce the number of syscalls done using Linux's high-performance multi-data syscalls like writev(), recvmmsg() and sendmmsg(), in order to not have to do one syscall per UDP packet (which is a lot of syscalls given that the MTU restricts them to 1500 bytes size).

    • Tinc 1.1 has already implemented the use of recvmmsg() (original patch with details here).

    • For the reader passing-by, the flow through kernel and userspace with Tinc as a user-space VPN works like this: UDP packet arrives on real socket, tincd reads it, decrypts, writes it to tun interface file descriptior, from which the local client application can read it like over a real socket. When an application sends something over the Tinc VPN, sending it over the VPN network interface, tincd reads it from the tun device, encrypts it, and sends it out over the real network socket.

    • Tinc 1.0 performs the syscall chain select - read - sendto - recvfrom - write for each UDP packet it receives, that is:

      • select (wait for data)
      • read (from tun device from which user data comes)
      • sendto (to tinc peer via UDP)
      • recvfrom (from peer tinc via UDP)
      • write (to tun device for user application)
    • We can observe this nicely by running perf trace -p $(pidof tincd) during an iperf3 session:

      23156.949 ( 0.005 ms): pselect6(n: 8, inp: 0x7fff63d12350, outp: 0x7fff63d123d0, tsp: 0x7fff63d12270, sig: 0x7fff63d12280) = 2
      23156.957 ( 0.004 ms): read(fd: 3</dev/net/tun>, buf: 0x7fff63d11c16, count: 1508            ) = 1457
      23156.972 ( 0.011 ms): sendto(fd: 5, buff: 0x7fff63d11c08, len: 1471, addr: 0x7f6397e1f480, addr_len: 16) = 1471
      23156.979 ( 0.004 ms): recvfrom(fd: 5, ubuf: 0x7fff63d11528, size: 1661, addr: 0x7fff63d11500, addr_len: 0x7fff63d114ec) = 70
      23156.992 ( 0.009 ms): write(fd: 3</dev/net/tun>, buf: 0x7fff63d11536, count: 56             ) = 56
      

      Notably (this is Tinc 1.0.26) one syscall is made per UDP packet, we can see it in the size of read() and sendto(), in this case 1457 and 1471 - just below the interface's default MTU of 1500. (The reads from the other side are small, as it makes sense, as I'm sending from this node using iperf -c othernode.)

    • As mentioned above, in Tinc 1.1 recvmmsg() is used, which batches many of those little recvfroms, and improves the syscall chain to N * (select - read - sendto - gettimeofday) - recvmmsg - write write write...

    • For one of these chains, the time spent is roughly (on that smallest DigitalOcean instance):

      • 8 ms for the N write calls
      • 0.2 ms for the recvmmsg that's the equivalent data of all those writes - that optimisation seems to have worked very well, a 40x syscall time difference for the same data
      • 8 ms for the N select - read - sendto - gettimeofday calls, of this roughly:
        • 2 ms for select
        • 2 ms for read
        • 2 ms for sendto
        • 2 ms for gettimeofday
    • Consequently, there's still approximately as much to optimise on the "write to tun device" side as there is on the "read from real socket" side.

    • It seems that the following optimisations remain possible:

      • Batching the writes into one writev().
        • Here's an overview on the relevant code (as of commit e44c337ea): At the place where the recvmmsg() takes place, it does for(int i = 0; i < num /* received packets */; i++) { handle_incoming_vpn_packet(ls, &pkt[i], &addr[i]) } for each packet, and that handle_incoming_vpn_packet() eventually leads via receive_packet(), route(), route_ipv4(), send_packet(), devops.write(packet), write_packet() to the write() syscall. I assume that's the data path that would have to be changed to operate on the entire num packets if we want to write them in one writev() call eventually.
        • My guess is that all function on this code path would have to be changed to an array of packets instead of a single packet.
        • Care must be taken at the route() function, where the packets can be split: Some may not be for us, but to be forwarded to other nodes, so they would not have to be written to our tun device. However, writev() should still be able to deal with this in a zero-copy fashion, since the N starting addresses it takes do not have to be contiguous.
      • Batching the sendtos into one sendmmsg().
        • This is very analogous to the writev() point above, but for the real socket, not to the tun device.
        • It is my understanding that no special care with routing would be needed, since when on their way outward, all packets already contain their target addresses, and from those the inputs to sendmmsg can be directly constructed.
      • Batching the reads into one bigger read().
        • I did not look into this in detail, but since the tun device is just a file descriptor, my guess is that we could simply read bigger chunks witht the same syscall here. But I may be wrong.
      • Removing the gettimeofdays
        • That gettimeofday is new in Tinc 1.1, version 1.0 didn't have that per packet. I think it is used to check that the MAC'd packet is recent, but not sure yet.
        • It is a common optimisation in web servers to debounce these gettimeofday calls so that they happen only in specific scheduled intervals, typically using a thread.
        • @gsliepen mentioned that only one this is needed per select() call, of which we would have much less when both recvmmsg() and sendmmsg() are implemented, so this optimisation may no longer be necessary at that point.
    • As a result, the optimal syscall chain would probably be: select - read - sendmmsg - gettimeofday - recvmmsg - writev.

    • I expect that we could get a similar 40x overhead reduction as with recvmmsg() - if this turns out true, we'd be in good shape.

Overall, I'm quite confident that by doing these two optimisations (hardware accelerated AES and multi-packet bundling syscalls), Tinc will be able to achive Gigabit link speed with little or negligible CPU utilisation, even on those small cloud instances. (And maybe saturate 10 Gig Ethernet on real machines?)

Now we just have to implement them :)

nh2 avatar Mar 26 '16 22:03 nh2

Correction from @gsliepen:

writev() is not the equivalent of sendmmsg(). If you do a writev() to a tun device, it will be treated as one packet.

last time I looked there was some functionality in the kernel, but it was not exposed to userspace

Indeed.

There's also an earlier comment on the mailing list on this.

My guess is that this is the kernel patch in question.

nh2 avatar Mar 26 '16 23:03 nh2

Regarding crypto:

  • gcrypt can do AES-NI too (http://lists.gnu.org/archive/html/info-gnu/2011-06/msg00012.html) and Tinc already works with that
  • https://www.gnupg.org/blog/20131215-gcrypt-bench.html contains an (already older) comparison between Salsa20 and AES-NI on gcrypt

nh2 avatar Mar 27 '16 00:03 nh2

I'm using tinc 1.0 and see similar high CPU loads. Would it be worth switching to 1.1?

ewoutp avatar Apr 29 '16 11:04 ewoutp

On Fri, Apr 29, 2016 at 04:51:13AM -0700, Ewout Prangsma wrote:

I'm using tinc 1.0 and see similar high CPU loads. Would it be worth switching to 1.1?

You can try. Tinc 1.1 supports recvmmsg(), it might reduce system call overhead a bit. But other than that, there is not much which would make 1.1 have a lower CPU load than 1.0.

Met vriendelijke groet / with kind regards, Guus Sliepen [email protected]

gsliepen avatar Apr 29 '16 13:04 gsliepen

Regarding encryption performance: I suspect the implementation of Chacha2020/Poly1305 in tinc 1.1 is relatively slow compared to alternatives. It's a somewhat naive implementation written in plain C with no CPU-specific optimizations. I believe @gsliepen initially opted to use that because it was the simplest option at the time - no need for external dependencies (the implementation is inside the tinc source tree) or exotic build rules. Also, at the time the third party crypto libraries either did not support this cipher or were just as slow, but I suspect that's not true today.

Benchmarks show that optimized implementations can make a lot of difference: https://bench.cr.yp.to/impl-stream/chacha20.html

dechamps avatar May 17 '17 20:05 dechamps

I'm not in a position to assist at all, but I just wanted to say it's a fantastic write up @nh2 - thanks for taking the time :)

stevesbrain avatar May 19 '17 01:05 stevesbrain

@nh2 we too are facing this issue on our DigitalOcean cloud VMs. We are running the older 1.0 branch (for stability reasons) and currently seeing only rx: 23.56 Mbit/s (3257 p/s) and tx: 50.07 Mbit/s (8574 p/s) on one of our more central nodes.

splitice avatar Jun 01 '17 03:06 splitice

My guess is that this is the kernel patch in question.

I've sent an email to @alexgartrell to ask him if he's still interested in this patch.

It would be great if this could land in the kernel!

nh2 avatar Jun 01 '17 20:06 nh2

@nh2 isn't IFF_MULTI_READ for the read side of tun not the write. Am I misunderstanding that patch?

splitice avatar Jun 01 '17 22:06 splitice

@splitice Your understanding of the initial submission of the patch is right, but the conversation ended with

Sounds good to me. I'll get a patch turned around soon.

replying to

If we were to use recvmmsg obviously we'd create a new interface based on sockets for tun and expose the existing socket through that.

The current file-based tun interface was never designed to be a high-performance interface. So let's take this opportunity and create a new interface

so it was my understanding that this (making a tun replacement on which you can use all of the *mmsg functions) is what the plan was.

nh2 avatar Jun 01 '17 23:06 nh2

I just did some benchmarks today, on my Intel Core i7-2600 running Linux 4.14.

The baseline, using GCC 7.2.0 and tinc bdeba3f (latest 1.1 repo), is 1.86 Gbit/s raw SPTPS performance according to sptps_test ("SPTPS/UDP transmit"). Fiddling with GCC optimizations (-O3 -march=native) doesn't seem to change anything. Switching to clang 5.0.1 tends to make things worse (1.65 Gbit/s), unless further optimizations (beyond -O2) are enabled, in which case it's on par with GCC.

I set up a more realistic benchmark involving two tinc nodes running on the same machine, and then using iperf3 over the tunnel between the two nodes. In the baseline setup, the iperf3 throughput was 650 Mbit/s. During the test, both nodes used ~6.5 seconds of user CPU time per GB each. In addition, the transmitting node used ~6.2 seconds of kernel CPU time per GB, while the receiving node used ~5.5 seconds of kernel time per GB. (In other words, user/kernel CPU usage is roughly 50/50.)

I hacked together a patch to make tinc use OpenSSL 1.1.0g (EVP_chacha20_poly1305()) for Chacha20-Poly1305, instead of the tinc built-in code. Indeed OpenSSL has more optimized code for this cipher, including hand-written assembly. As a result, raw SPTPS performance jumped to 4.19 Gbit/s, a ~2X improvement over the baseline. (I would expect more bleeding-edge versions of OpenSSL to provide even better performance, as more CPU-specific optimizations have been done recently.)

Unfortunately, because tinc spends a lot of time in the kernel, the improvement in the iperf3 benchmark was not as impressive: 785 Mbit/s, using ~4.0 seconds of user CPU time per GB. (Which means user/kernel CPU usage is roughly 40/60 in this test.)

I also tried libsodium 1.0.16, but the raw SPTPS performance wasn't as impressive: 1.95 Gbit/s, barely an improvement over the tinc built-in code.

It looks like it would be worth it to use OpenSSL for Chacha20-Poly1305 as it is clearly much faster than the current code. But in any case, the syscall side of things definitely needs to be improved as well as it immediately becomes the dominant bottleneck as crypto performance improves.

dechamps avatar Jan 06 '18 19:01 dechamps

Yeah and syscall overhead is only going to grow thanks to Meltdown. I think a plan of attack is needed:

  1. Investigate ways to read AND write multiple packets in one go from/to /dev/tun.
  2. Modify tinc to batch writes to sockets and to /dev/tun.
  3. Make tinc use the OpenSSL versions of Chacha20-Poly1305 if it's linking with OpenSSL anyway.

I think that can all be done in parallel. Item 2 can just do write() in a loop until we find the optimum way to send batches of packets to /dev/tun.

I'd like to keep the C-only version of Chacha20-Poly1305 in tinc; it's very nice for running tinc in embedded devices where space is a premium.

gsliepen avatar Jan 06 '18 20:01 gsliepen

Another data point: if I bypass the crypto completely, I get 22.87 Gbit/s on the raw SPTPS throughput test (duh...). iperf3 throughput is about 1 Gbit/s, and user CPU usage is around ~2 seconds per gigabyte.

@gsliepen: according to my perf profiling on the sending side, it kinda looks like it's the UDP socket path that's expensive, not the TUN/TAP I/O paths. Though I suppose the relative cost of these paths could be system dependent.

   - 75.25% do_syscall_64                                                                                                                  ▒
      - 43.92% sys_sendto                                                                                                                  ▒
         - 43.80% SYSC_sendto                                                                                                              ▒
            - 42.69% sock_sendmsg                                                                                                          ▒
               - 42.42% inet_sendmsg                                                                                                       ▒
                  - 41.69% udp_sendmsg                                                                                                     ▒
                     - 32.25% udp_send_skb                                                                                                 ▒
                        - 31.73% ip_send_skb                                                                                               ▒
                           - 31.57% ip_local_out                                                                                           ▒
                              - 31.29% ip_output                                                                                           ▒
                                 - 30.95% ip_finish_output                                                                                 ▒
                                    - 30.62% ip_finish_output2                                                                             ▒
                                       - 26.26% __local_bh_enable_ip                                                                       ▒
                                          - 26.09% do_softirq.part.17                                                                      ▒
                                             - do_softirq_own_stack                                                                        ▒
                                                - 25.59% __softirqentry_text_start                                                         ▒
                                                   - 24.97% net_rx_action                                                                  ▒
                                                      + 24.25% process_backlog                                                             ▒
                                       + 3.81% dev_queue_xmit                                                                              ▒
                     + 5.82% ip_make_skb                                                                                                   ▒
                     + 2.59% ip_route_output_flow                                                                                          ▒
      - 10.58% sys_write                                                                                                                   ▒
         - 10.42% vfs_write                                                                                                                ▒
            - 10.12% __vfs_write                                                                                                           ▒
               - 10.08% new_sync_write                                                                                                     ▒
                  - 10.02% tun_chr_write_iter                                                                                              ▒
                     - 9.86% tun_get_user                                                                                                  ▒
                        - 8.75% netif_receive_skb                                                                                          ▒
                           - 8.72% netif_receive_skb_internal                                                                              ▒
                              - 8.63% __netif_receive_skb                                                                                  ▒
                                 - __netif_receive_skb_core                                                                                ▒
                                    - 8.46% ip_rcv                                                                                         ▒
                                       - 8.32% ip_rcv_finish                                                                               ▒
                                          - 8.02% ip_local_deliver                                                                         ▒
                                             - 7.99% ip_local_deliver_finish                                                               ▒
                                                - 7.87% tcp_v4_rcv                                                                         ▒
                                                   - 7.36% tcp_v4_do_rcv                                                                   ▒
                                                      + 7.26% tcp_rcv_established                                                          ▒
      - 9.98% sys_select                                                                                                                   ▒
         - 8.13% core_sys_select                                                                                                           ▒
            - 6.22% do_select                                                                                                              ▒
                 1.80% sock_poll                                                                                                           ▒
               + 0.98% __fdget                                                                                                             ▒
                 0.82% tun_chr_poll                                                                                                        ▒
      - 6.87% sys_read                                                                                                                     ▒
         - 6.56% vfs_read                                                                                                                  ▒
            - 5.32% __vfs_read                                                                                                             ▒
               - 5.21% new_sync_read                                                                                                       ▒
                  - 5.00% tun_chr_read_iter                                                                                                ▒
                     - 4.76% tun_do_read.part.42                                                                                           ▒
                        + 3.15% skb_copy_datagram_iter                                                                                     ▒
                        + 0.91% consume_skb                                                                                                ▒
            + 0.82% rw_verify_area                                                                                                         

One thing that comes to mind would be to have the socket I/O be done in a separate thread. Not only would that scale better (the crypto would be done in parallel with I/O, enabling the use of multiple cores), it would also make it possible for that thread to efficiently use sendmmsg() if more than one packet has accumulated inside the sending queue since the last send call started (coalescing).

dechamps avatar Jan 06 '18 22:01 dechamps

Hm, where's sys_recvfrom in your perf output? It would be nice to see how that compares to sys_sendto. (Of course, I should just run perf myself...)

gsliepen avatar Jan 07 '18 11:01 gsliepen

I did not include it because it's negligible (thanks to the use of recvmmsg(), I presume):

      - 76.54% do_syscall_64                                                                                                               ▒
         + 43.47% sys_sendto                                                                                                               ▒
         + 11.39% sys_write                                                                                                                ▒
         + 10.27% sys_select                                                                                                               ▒
         + 6.74% sys_read                                                                                                                  ▒
         + 1.56% syscall_slow_exit_work                                                                                                    ▒
         - 1.51% sys_recvmmsg                                                                                                              ▒
            - 1.49% __sys_recvmmsg                                                                                                         ▒
               - 1.44% ___sys_recvmsg                                                                                                      ▒
                  - 1.01% sock_recvmsg_nosec                                                                                               ▒
                     - 1.00% inet_recvmsg                                                                                                  ▒
                          0.97% udp_recvmsg                                                                                                ▒
           0.70% syscall_trace_enter                                                                                                       

Here's how things look like on the receiving side (which is not the bottleneck in my benchmark):

      - 73.10% do_syscall_64                                                                                                               ▒
         - 26.20% sys_sendto                                                                                                               ▒
            - 26.09% SYSC_sendto                                                                                                           ▒
               - 25.41% sock_sendmsg                                                                                                       ▒
                  - 25.26% inet_sendmsg                                                                                                    ▒
                     - 24.96% udp_sendmsg                                                                                                  ▒
                        + 19.31% udp_send_skb                                                                                              ▒
                        + 3.17% ip_make_skb                                                                                                ▒
                        + 1.73% ip_route_output_flow                                                                                       ▒
         - 26.06% sys_write                                                                                                                ▒
            - 25.54% vfs_write                                                                                                             ▒
               - 24.47% __vfs_write                                                                                                        ▒
                  - 24.36% new_sync_write                                                                                                  ▒
                     - 24.02% tun_chr_write_iter                                                                                           ▒
                        - 23.43% tun_get_user                                                                                              ▒
                           - 17.87% netif_receive_skb                                                                                      ▒
                              - 17.74% netif_receive_skb_internal                                                                          ▒
                                 - 17.29% __netif_receive_skb                                                                              ▒
                                    - 17.20% __netif_receive_skb_core                                                                      ▒
                                       - 16.42% ip_rcv                                                                                     ▒
                                          - 15.77% ip_rcv_finish                                                                           ▒
                                             + 14.30% ip_local_deliver                                                                     ▒
                                             + 0.95% tcp_v4_early_demux                                                                    ▒
                           + 1.12% copy_page_from_iter                                                                                     ▒
                           + 0.97% skb_probe_transport_header.constprop.62                                                                 ▒
                             0.90% __skb_get_hash_symmetric                                                                                ▒
                           + 0.74% build_skb                                                                                               ▒
         - 8.42% sys_select                                                                                                                ▒
            - 6.92% core_sys_select                                                                                                        ▒
               + 5.44% do_select                                                                                                           ▒
         - 5.84% sys_recvmmsg                                                                                                              ▒
            - 5.72% __sys_recvmmsg                                                                                                         ▒
               - 5.48% ___sys_recvmsg                                                                                                      ▒
                  - 2.83% sock_recvmsg_nosec                                                                                               ▒
                     - 2.81% inet_recvmsg                                                                                                  ▒
                        + 2.75% udp_recvmsg                                                                                                ▒
                  + 1.15% sock_recvmsg                                                                                                     ▒
                  + 0.85% copy_msghdr_from_user                                                                                            ▒
         - 3.05% sys_read                                                                                                                  ▒
            - 2.84% vfs_read                                                                                                               ▒
               - 2.10% __vfs_read                                                                                                          ▒
                  - 2.02% new_sync_read                                                                                                    ▒
                     - 1.86% tun_chr_read_iter                                                                                             ▒
                        + 1.69% tun_do_read.part.42                                                                                        

Even on the receiving side the UDP RX path is quite efficient and there is still a large amount of time spent in the UDP send path (presumably to send the TCP acknowledgements for the iperf3 stream).

(Note: as a reminder, all my perf reports are with all chacha-poly1305 code bypassed, to make syscall performance issues more obvious.)

dechamps avatar Jan 07 '18 12:01 dechamps

I believe the main reason why the UDP TX path is so slow is because Linux runs all kinds of CPU-intensive logic in that call (including selecting routes and calling into netfilter, it seems), and it does that inline in the calling thread, even if the call is non-blocking.

If that's true, then it means that the performance of that path would also depend on the complexity of the network configuration on the machine that tinc is running on (i.e. routing table, iptables rules, etc.), which in the case of my test machine is actually fairly non-trivial, so my results might be a bit biased in that regard.

If we move these syscalls to a separate thread, then it might not do much in terms of CPU efficiency, but it would at least allow tinc to scale better by having all this kernel-side computation happen in parallel on a separate core.

dechamps avatar Jan 07 '18 12:01 dechamps

Ok, so we need four single-producer, single-consumer ringbuffers: tun rx, tun tx, udp rx, udp tx. Each ringbuffer gets its own thread to do I/O. We also need to signal the main event loop; on Linux this could be done using eventfd. The threads doing UDP I/O can use sendmmsg()/recvmmsg() if possible.

gsliepen avatar Jan 07 '18 13:01 gsliepen

@dechamps I can confirm that on machines with more complicated routing rules that outgoing performance on tinc does decrease. I haven't got any benchmarks currently but it was something our devops team noted between staging and production. It wasn't too significant for us though (our configuration involves 6-8 routing rules, and significant IPTables rules, which tinc is able to bypass early).

Does sendmmsg have to evaluate the routing tables for each, or does it cache (post 3.6 w/ removal of route cache)? Not that I am saying this wouldn't lead to other savings but it might not be the savings being imagined.

splitice avatar Jan 07 '18 13:01 splitice

Ok, so we need four single-producer, single-consumer ringbuffers: tun rx, tun tx, udp rx, udp tx. Each ringbuffer gets its own thread to do I/O.

Sounds good. As a first approach, TUN TX and UDP TX is much simpler than RX because these paths don't use the event loop at all - they just call write() and sendto() directly, dropping the packet on the floor if the call would block. This means there's no need to coordinate with the event loop for TX - it's just a matter of writing a drop-in replacement for write() and sendto() that transparently offloads the syscall to a separate thread for asynchronous execution, dropping the packet if the queue for that separate thread is too busy.

dechamps avatar Jan 07 '18 13:01 dechamps

After some more investigation, one potential issue with sending UDP packets asynchronously is that it prevents the main thread from immediately getting EMSGSIZE feedback for PMTU discovery purposes. The sender thread would have to call back to update the MTU, which suddenly makes everything way more complicated. We might have to choose between one or the other for the time being.

dechamps avatar Jan 07 '18 16:01 dechamps

Well, it slows it down a bit, but PMTU works without the EMSGSIZE feedback as well.

gsliepen avatar Jan 07 '18 16:01 gsliepen

I threw together some hacky code to move sendto() calls to a separate thread, along with a 128-packet queue (I have not tried sendmmsg()). iperf3 performance (with crypto enabled) improved from 650 Mbit/s (see https://github.com/gsliepen/tinc/issues/110#issuecomment-355771367) to 720 Mbit/s. New bottleneck appears to be the TUN write path.

If I combine that asynchronous UDP send code with my other hacky patch to use OpenSSL for Chacha20-Poly1305, I manage to reach 935 Mbit/s, just shy of the gigabit mark.

For those interested, here are the proof of concepts:

  • OpenSSL Chacha20-Poly1305: https://github.com/dechamps/tinc/commit/73d7ec16ae8dc7fb4ab8adcea169ceec5261ab34 (cannot interoperate with normal 1.1 nodes because the crypto code is incomplete, e.g. it ignores seqno)
  • Asynchronous multithreaded UDP send: https://github.com/dechamps/tinc/compare/43cf631bc10097448db041639ad07f84f647017e...dechamps:32ccdd8d9d5454da12eec1141ff2e734426c97e5 (usable, but it's missing configuration knobs and other minor stuff)

dechamps avatar Jan 07 '18 17:01 dechamps

Very nice. One issue though is that it's making copies of packets, for obvious reasons. But since we're worried about performance already, we should probably have a pool of vpn_packet_t's that we can pick from and hand them over to other threads.

I also like that dropin C11 thread library!

gsliepen avatar Jan 07 '18 19:01 gsliepen

…and here's a proof of concept for the final piece, asynchronous TUN writes: https://github.com/dechamps/tinc/commit/001cd46a0d53488119dfe6bec62a3ad367b7f8ea

With that combined with the asynchronous UDP send, I get 770 Mbit/s. If I combine everything described so far (asynchronous UDP send and TUN write, plus OpenSSL crypto), I can reach 1140 Mbit/s, finally breaching that symbolic Gigabit barrier. This is ~1.75x vanilla tinc performance. It's quite possible that this can be improved further by tuning some knobs or writing things in a smarter way (such as the suggestion that @gsliepen just made above).

CPU usage looked as follows during the fully combined iperf3 stress test:

Thread Sending node Receiving node
Main 95% 95%
UDP send 75% 35%
TUN write 21% 53%

So basically, each tinc node is now able to scale to two CPUs instead of just one.

I haven't looked at the new bottlenecks too closely, but at first glance the main thread seems to be spending as much time in select() as in crypto code. Perhaps that could be the next area for improvement (epoll() comes to mind).

dechamps avatar Jan 07 '18 19:01 dechamps

I created a hopefully generic buffer pool with an asynchronous hook, see 9088b93. I kept async_device.[ch] and async_send.[ch], but instead of having the functions there do memcpy() into a buffer they get from the pool, functions that currently allocate a vpn_packet_t on the stack should do vpn_packet_t *pkt = async_pool_get(vpn_packet_pool) and pass that on until the place where we'd normally do the write(), and instead call async_pool_put(vpn_packet_pool, pkt). And something analogous for sending UDP packets, although that might be a bit more complicated.

gsliepen avatar Jan 08 '18 22:01 gsliepen

Now with asynchronous reading from the tun device: cc5e809.

gsliepen avatar Jan 14 '18 15:01 gsliepen

Any update on this? It sounds like it could really speed up tinc.

breisig avatar Mar 01 '18 05:03 breisig

I'm not sure I'll have the time to clean this up any time soon, so if anyone is up for it, feel free to pick this up. Pinging @millert as perhaps he might be interested in some more coding fun.

dechamps avatar Mar 01 '18 20:03 dechamps

I can devote some time to this. Do we want to go with tinycthread or would you rather use pthreads/winthreads directly?

millert avatar Mar 07 '18 18:03 millert

@millert Thanks for volunteering :) @gsliepen indicated in https://github.com/gsliepen/tinc/issues/110#issuecomment-355844766 that he liked the idea of using tinycthread, and I agree, so I would recommend using that. (The alternatives are writing different code for the two platforms - which is, well, not great - or using pthreads-Win32, but that's very old and requires adding a dependency on another library that needs to be linked in; whereas tinycthread is just a single drop-in C file and it's future-proof since it implements a standard C API.)

@gsliepen: did you measure any improvements when you experimented with a generic buffer pool? I suspect that this wouldn't make much of a difference and that it would be simpler to just do the naive thing like I did in my code, but I'll admit I'm just speculating here.

@millert: I'm not sure if you're interested in the OpenSSL crypto stuff too, or just the multi-threaded I/O. If you're interested in interfacing with OpenSSL for Chacha20-Poly1305, keep in mind that I have not checked that OpenSSL uses the same message formats and conventions (with respect to keys, etc.) that tinc uses. There is documented evidence that at least three incompatible variants of ChaCha20-Poly1305 exist in the wild, and it's not clear to me which one OpenSSL uses (or even which one tinc uses, for that matter). I did not attempt to make my experimental "OpenSSL tinc" communicate with "vanilla tinc" nodes, and I suspect there might be some challenges there.

dechamps avatar Mar 07 '18 18:03 dechamps