turn
turn copied to clipboard
[RFC] eBPF offload consideration
Hi,
Me and @rg0now have been investigating on boosting pion/turn
performance with eBPF. As a first step, we implemented an eBPF/XDP offload for UDP channel bindings. This way, pion/turn
can offload the channel data processing to the kernel. Below we present our implementation details, early results and call for a discussion to consider eBPF offload in pion/turn
.
Implementation details
How does it work?
The XDP offload handles ChannelData messages only. The userspace TURN server is responsible for all the other functionality from building channels to handle requests and etc. The offload mechanisms are activated after a successful channel binding, in the method Allocation.AddChannelBind
. The userspace TURN server sends peer and client info (5-tuples and channel id) to the XDP program via an eBPF map. From that point the XDP program can detect channel data coming from the peer or from the client. When a channel binding gets removed the corresponding data will be deleted from the eBPF maps and thus there will be no offload for that channel.
Changes to pion/turn
New: We introduce a new internal offload
package, which manages offload mechanisms. Currently, there are two implementations: the XDPOffload that uses XDP, and a NullOffload for testing purposes.
Changed: The kernel offload complicates lifecycle management since eBPF/XDP offload outlives TURN server objects. This calls for new public methods in package turn to manage the offload engine's lifetime: InitOffload
starts the offload engine (e.g., loads the XDP program and creates eBPF maps) and ShutdownOffload
removes the offload engine. Note that these methods should be called by the application as shown in the server_test.go
benchmark.
But after everything is set up, channel binding offload management happens in Allocation.AddChannelBind
and Allocation.DeleteChannelBind
with no change in their usage.
eBPF/XDP details
The XDP part consist of a program that describes the packet processing logic to be executed when the network interface receives a packet. The XDP program uses eBPF maps to communicate with the user space TURN server.
Maps: The XDP offload uses the following maps to keep track of connections, store statistics, and to aid traffic redirects between interfaces:
name | key | value | function |
---|---|---|---|
turn_server_downstream_map |
peer 5-tuple | client 5-tuple + channel-id | match peer -> client traffic |
turn_server_upstream_map |
client 5-tuple + channel-id | peer 5-tuple | match client -> peer traffic |
turn_server_stats_map |
5-tuple + channel id | stats (#pkts, #bytes) | traffic statistics per connection (5-tuple and channel-id) |
turn_server_interface_ip_addresses_map |
interface index | IPv4 address | interface IP addresses for redirects |
XDP Program: The XDP program receives all packets as they arrive to the network interface. It filters IPv4/UDP packets (caveat: VLAN and other tunneling options are not supported), and checks whether the packets belong to any channel binding (i.e., checks the 5-tuple and channel-id). If there is a match, the program does the ChannelData handling: updates 5-tuple, adds or removes the ChannelData header, keeps track of statistics, and finally redirects the packet to the corresponding network interface. Other non channel data packets are passed to the network stack for further processing (e.g., channel refresh messages and other STUN/TURN traffic goes to user space TURN server).
Results
CPU profiling
Prior results are promising. The CPU profiling with the benchmark (#298) shows that the server.ReadLoop()
that took 47.9 sec before, runs for 0.96 sec with the XDP offload.
Flame graph w/o the offload:
Flame graph w/ XDP offload:
Microbenchmark with simple-server
Measurements with iperf, turncat (our in-house TURN proxy), and the simple-server example show outstanding (150x!) delay reduction and significant (6x) bandwidth boost.
Measurement setup
- iperf-client, turncat, simple-server, and iperf-server communicate via the localhost
- measure steady-state (channel binding done) over 50 sec
- 3 pion/turn versions:
- single: single-threaded simple-server
- multi: 4-thread simple-multithreaded server (#295)
- xdp: simple server with XDP offload
- results based on 3 consecutive runs
Delay results
avg[ms] | simple | multi | xdp |
---|---|---|---|
avg | 3.944 | 4.311 | 0.033 |
min | 3.760 | 0.473 | 0.023 |
median | 3.914 | 4.571 | 0.027 |
max | 4.184 | 5.419 | 0.074 |
Bandwidth results
Note iperf stalls at ~220k pps, we assume 1+ mpps with a powerful load generator
[pps] | simple | multi | xdp |
---|---|---|---|
avg | 36493 | 96152 | 227378 |
min | 35241 | 91856 | 222567 |
median | 36617 | 96843 | 227783 |
max | 37545 | 99455 | 233559 |
Discussion
- XDP offload is straightforward for UDP connections, but is cumbersome for TCP and TLS. Fortunately, the eBPF ecosystem provides other options: tc and sockmap can be potential alternatives with a reasonable complexity-performance trade-off.
- Yet we need to coordinate the different offload mechanisms for different connections.
- In addition, offload mechanisms introduce new lifecycle management scale: these mechanisms overlive TURN server objects.
- The eBPF objects needs to be built and distributed, and this makes the build process more complex.
- New dependency: cilium/ebpf.
- Build process gets more complex: eBPF objs are built via go generate; how to integrate it with the current build process; e.g., add a Makefile?
- Monitoring is not trivial due to the lifetime of XDP objects and becasue in XDP conections are identified by 5-tuples and we loose the notion of 'listeners'.
- Therefore, current monitoring implementation is initial. The bytes and pkts sent via a 5-tuple are stored in a statistics eBPF map. We update the counters in the statistics map, but we do not delete from it. There is no interface exposed for querying statistics (one can use
bpftool
to dump the map content)
- Therefore, current monitoring implementation is initial. The bytes and pkts sent via a 5-tuple are stored in a statistics eBPF map. We update the counters in the statistics map, but we do not delete from it. There is no interface exposed for querying statistics (one can use
- XDP Limitations: The
bpf_redirect()
that handles packet redirects in eBPF/XDP supports redirects to NIC egress queues in XDP. This prevents supporting scenarios when clients exchange traffic in a server-local 'loop'.- We disabled the XDP offload for host-local redirects. We had some weird issues with forwarding traffic between NICs with the xdp driver to NICs with the xdpgeneric drivers (except the
lo
interface). - A packet size limit is set in the XDP program to prevent fragmentation. Currently the limit is 1480 bytes.
- We disabled the XDP offload for host-local redirects. We had some weird issues with forwarding traffic between NICs with the xdp driver to NICs with the xdpgeneric drivers (except the
This is pretty magical! Great work :)
I am in support of adding this. I think people will find this useful
I added you to the repo @levaitamas
Unfortunately these days I don't have much bandwidth to get involved. I would love to support you everywhere I can. If you want me to add other developers so you can work together happy to do that
Thanks @Sean-der! I do appreciate adding me to the repo, since that would definitely ease supporting the eBPF offload once it gets integrated. I would recommend adding @rg0now. He has a great understanding of the pion ecosystem, and already made impactful contributions (e.g., support multi-thread UDP).
Done! That was a major oversight that @rg0now wasn’t in already :(
CC @stv0g
Great work! I am also in support of getting this in 👍🏻
I was going to make a PR that adds a user configurable callback at these locations to allow configuring external network accelerators, but I see you already did it. Thanks!
https://github.com/l7mp/turn/blob/4b776f2d67b2256552f8298f450b4b0640b17183/internal/allocation/allocation.go#L128 and https://github.com/l7mp/turn/blob/4b776f2d67b2256552f8298f450b4b0640b17183/internal/allocation/allocation.go#L171