Add multi-core concurrent packet processing
ZeroTier on multiple threads
This patch enables concurrent processing of packets in the RX and TX directions and appears to improve performance significantly in low-powered hardware such as arm chips in routers, raspberry pis, etc.
This has only been implemented for Linux and FreeBSD.
Example usage (local.conf):
{
"settings":
{
"multicoreEnabled": true,
"concurrency": 4,
"cpuPinningEnabled": false
}
}
Awesome! Can you make it not compile on mac, windows, etc? I know you know it doesn't work there, but it's worth testing ifdefs. I made myself a branch with all the current PRs, and this one makes that branch not work on my mac (obviously). I'm not sure if it's feasible to make it local.conf setting, so we can get the code in, but not enabled by default, but that would be cool IMO.
I wonder if this could improve performance on smaller cpus like the ones in commercial NASs
Update: Packet re-ordering seemed to be an issue in situations where a single TCP stream was being received by a large number of high-performance cores so the following changes were made which I believe are a good compromise for the time being:
This latest commit will not have multicore enabled by default, it can be enabled with ZT_ENABLE_MULTICORE=1
When enabled it will only use 2 cores if at least 4 logical cores are available. No matter how many cores beyond that are present it will only use 2. To override this you can set ZT_CONCURRENCY=N.
To experiment with core pinning you can use ZT_CORE_PINNING=1 but this is most likely a bad idea so do this last.
Suggested default usage:
sudo ZT_ENABLE_MULTICORE=1 ./zerotier-one
I am interested in hearing how this performs for people.
Thanks.
I wonder if this could improve performance on smaller cpus like the ones in commercial NASs
Yes, exactly. This is where I'm seeing the best gains in my testing.
~~@joseph-henry sorry for the dumb question: is the Dockerfile.ci enought to quickily test this?~~
EDIT: I forgot about the fact that the NAS images do build from source
I am interested in hearing how this performs for people.
I'm not entirely sure if I'm building it right but I just did some tests, in particular related to video workflows using Blackmagic Disk Speed Test. Source was a Synology DS1522+ (zerotier built from source, no other containers nor connections active). Destination is a Win11+Ryzen 5800X (zerotier 1.14 stable) over a public network.
The connection should reach 100mbit/s from Win to Synology, and 900mbit/s from Synology to Win. All tests are from the Win machine perspective.
- With
ZT_ENABLE_MULTICORE=0: upload is 100mbit/s; download is ~410mbit/s - With
ZT_ENABLE_MULTICORE=1andZT_CONCURRENCY=2: upload is ~80mbit/s; download is ~290mbit/s - With
ZT_ENABLE_MULTICORE=1andZT_CONCURRENCY=4: upload is ~50mbit/s; download is ~260mbit/s
ZT_CORE_PINNING=1 didn't make a difference, but I've also noticed during uploads that the speed is quite inconsistent
P.S: container running from this Docker Hub image (tag multicore-64634c9) built with this dockerfile.
I am interested in hearing how this performs for people.
I'm not entirely sure if I'm building it right but I just did some tests, in particular related to video workflows using Blackmagic Disk Speed Test. Source was a Synology DS1522+ (zerotier built from source, no other containers nor connections active). Destination is a Win11+Ryzen 5800X (zerotier 1.14 stable) over a public network.
The connection should reach 100mbit/s from Win to Synology, and 900mbit/s from Synology to Win. All tests are from the Win machine perspective.
1. With `ZT_ENABLE_MULTICORE=0`: upload is 100mbit/s; download is ~410mbit/s 2. With `ZT_ENABLE_MULTICORE=1` and `ZT_CONCURRENCY=2`: upload is ~80mbit/s; download is ~290mbit/s 3. With `ZT_ENABLE_MULTICORE=1` and `ZT_CONCURRENCY=4`: upload is ~50mbit/s; download is ~260mbit/s
ZT_CORE_PINNING=1didn't make a difference, but I've also noticed during uploads that the speed is quite inconsistentP.S: container running from this Docker Hub image (tag
multicore-64634c9) built with this dockerfile.
i get the same result with 50Mb/s Upload Connection
- with ZT_ENABLE_MULTICORE=0 = Upload 5MB/s
- with ZT_ENABLE_MULTICORE=1 and ZT_CONCURRENCY=2 = 4,4MB/s
- With ZT_ENABLE_MULTICORE=1 and ZT_CONCURRENCY= 4 i get around 4.0 MB/s
Thanks for your results everybody. It's still a work in progress.
Some updates:
- Packets are sorted by flow to prevent re-ordering (though this doesn't seem to be a full solution)
- Configuration is now done via
local.conf, not environment variables
Example config:
{
"settings":
{
"multicoreEnabled": true,
"concurrency": 4,
"cpuPinningEnabled": false
}
}
More updates to come.
