ZeroTierOne icon indicating copy to clipboard operation
ZeroTierOne copied to clipboard

Add multi-core concurrent packet processing

Open joseph-henry opened this issue 1 year ago • 8 comments

ZeroTier on multiple threads

This patch enables concurrent processing of packets in the RX and TX directions and appears to improve performance significantly in low-powered hardware such as arm chips in routers, raspberry pis, etc.

This has only been implemented for Linux and FreeBSD.

Example usage (local.conf):

{
   "settings":
   {
       "multicoreEnabled": true,
       "concurrency": 4,
       "cpuPinningEnabled": false
   }
}

joseph-henry avatar Feb 23 '24 17:02 joseph-henry

Awesome! Can you make it not compile on mac, windows, etc? I know you know it doesn't work there, but it's worth testing ifdefs. I made myself a branch with all the current PRs, and this one makes that branch not work on my mac (obviously). I'm not sure if it's feasible to make it local.conf setting, so we can get the code in, but not enabled by default, but that would be cool IMO.

laduke avatar Feb 29 '24 19:02 laduke

I wonder if this could improve performance on smaller cpus like the ones in commercial NASs

sandros94 avatar May 11 '24 02:05 sandros94

Update: Packet re-ordering seemed to be an issue in situations where a single TCP stream was being received by a large number of high-performance cores so the following changes were made which I believe are a good compromise for the time being:

This latest commit will not have multicore enabled by default, it can be enabled with ZT_ENABLE_MULTICORE=1

When enabled it will only use 2 cores if at least 4 logical cores are available. No matter how many cores beyond that are present it will only use 2. To override this you can set ZT_CONCURRENCY=N.

To experiment with core pinning you can use ZT_CORE_PINNING=1 but this is most likely a bad idea so do this last.

Suggested default usage:

sudo ZT_ENABLE_MULTICORE=1 ./zerotier-one

I am interested in hearing how this performs for people.

Thanks.

joseph-henry avatar Jul 03 '24 16:07 joseph-henry

I wonder if this could improve performance on smaller cpus like the ones in commercial NASs

Yes, exactly. This is where I'm seeing the best gains in my testing.

joseph-henry avatar Jul 03 '24 16:07 joseph-henry

~~@joseph-henry sorry for the dumb question: is the Dockerfile.ci enought to quickily test this?~~

EDIT: I forgot about the fact that the NAS images do build from source

sandros94 avatar Jul 04 '24 10:07 sandros94

I am interested in hearing how this performs for people.

I'm not entirely sure if I'm building it right but I just did some tests, in particular related to video workflows using Blackmagic Disk Speed Test. Source was a Synology DS1522+ (zerotier built from source, no other containers nor connections active). Destination is a Win11+Ryzen 5800X (zerotier 1.14 stable) over a public network.

The connection should reach 100mbit/s from Win to Synology, and 900mbit/s from Synology to Win. All tests are from the Win machine perspective.

  1. With ZT_ENABLE_MULTICORE=0: upload is 100mbit/s; download is ~410mbit/s
  2. With ZT_ENABLE_MULTICORE=1 and ZT_CONCURRENCY=2: upload is ~80mbit/s; download is ~290mbit/s
  3. With ZT_ENABLE_MULTICORE=1 and ZT_CONCURRENCY=4: upload is ~50mbit/s; download is ~260mbit/s

ZT_CORE_PINNING=1 didn't make a difference, but I've also noticed during uploads that the speed is quite inconsistent image

P.S: container running from this Docker Hub image (tag multicore-64634c9) built with this dockerfile.

sandros94 avatar Jul 04 '24 19:07 sandros94

I am interested in hearing how this performs for people.

I'm not entirely sure if I'm building it right but I just did some tests, in particular related to video workflows using Blackmagic Disk Speed Test. Source was a Synology DS1522+ (zerotier built from source, no other containers nor connections active). Destination is a Win11+Ryzen 5800X (zerotier 1.14 stable) over a public network.

The connection should reach 100mbit/s from Win to Synology, and 900mbit/s from Synology to Win. All tests are from the Win machine perspective.

1. With `ZT_ENABLE_MULTICORE=0`: upload is 100mbit/s; download is ~410mbit/s

2. With `ZT_ENABLE_MULTICORE=1` and `ZT_CONCURRENCY=2`: upload is ~80mbit/s; download is ~290mbit/s

3. With `ZT_ENABLE_MULTICORE=1` and `ZT_CONCURRENCY=4`: upload is ~50mbit/s; download is ~260mbit/s

ZT_CORE_PINNING=1 didn't make a difference, but I've also noticed during uploads that the speed is quite inconsistent image

P.S: container running from this Docker Hub image (tag multicore-64634c9) built with this dockerfile.

i get the same result with 50Mb/s Upload Connection

  1. with ZT_ENABLE_MULTICORE=0 = Upload 5MB/s
  2. with ZT_ENABLE_MULTICORE=1 and ZT_CONCURRENCY=2 = 4,4MB/s
  3. With ZT_ENABLE_MULTICORE=1 and ZT_CONCURRENCY= 4 i get around 4.0 MB/s

TommyKing avatar Aug 11 '24 04:08 TommyKing

Thanks for your results everybody. It's still a work in progress.

Some updates:

  • Packets are sorted by flow to prevent re-ordering (though this doesn't seem to be a full solution)
  • Configuration is now done via local.conf, not environment variables

Example config:

{
   "settings":
   {
       "multicoreEnabled": true,
       "concurrency": 4,
       "cpuPinningEnabled": false
   }
}

More updates to come.

joseph-henry avatar Aug 20 '24 20:08 joseph-henry