shadowsocks-rust Performance regression after switching to tokio-0.2

I tried to use go-shadowsocks2 as tunnel to have a test, the result doesn't look great...

It was shadowsocks-rust-async branch release build I was using in this test.

This is what I did:

start shadowsocks-rust server: ssserver -s 127.0.0.1:8488 -m chacha20-ietf -k password
start go-shadowsocks2 as tunnel: go-shadowsocks2 -verbose -c ss://chacha20-ietf:password@localhost:8488 -tcptun :1090=127.0.0.1:5201
start iperf server: iperf -s -p 5201
run iperf as client: iperf -c localhost -p 1090 -n 10G

The result was: [388] 0.0-130.7 sec 10.0 GBytes 657 Mbits/sec

If I replaced step 1 with go-shadowsocks2 as server: go-shadowsocks2 -verbose -s ss://chacha20-ietf:password@:8488

The result was: [384] 0.0-27.4 sec 10.0 GBytes 3.13 Gbits/sec

Nov 29 '19 16:11 cg31

I have seen many complains about performance regression after migrating to tokio v0.2.

Let's findout if there are any things could be optimized in this project first.

Nov 29 '19 16:11 zonyitoo

I just tried the master branch, it looks like also needed some optimization [384] 0.0-89.2 sec 10.0 GBytes 963 Mbits/sec

This is where a tunnel is useful.

For the record, I am using stable-x86_64-pc-windows-msvc, rustc 1.39.0.

Nov 29 '19 16:11 cg31

master branch is using future v0.1, which requires lots of Futures' transformation, that should be slower.

Nov 29 '19 17:11 zonyitoo

I didn't see any obvious performance problem on the feature-migrate-async-await branch. Any idea?

Nov 30 '19 14:11 zonyitoo

It is hard to spot the bottleneck just by looking at the code. Profiling tools are needed to find the hot spots.

Nov 30 '19 15:11 cg31

My test with the lastest commit (with tokio v0.2.2)

System

Darwin 19.0.0 Darwin Kernel Version 19.0.0: Thu Oct 17 16:17:15 PDT 2019; root:xnu-6153.41.3~29/RELEASE_X86_64 x86_64

TEST-1:

iperf -> sstunnel (rust) -> go-shadowsocks2 -> iperf

[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec  2.92 GBytes  2.50 Gbits/sec

TEST-2:

iperf -> sstunnel (rust) -> ssserver (rust) -> iperf

[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec  3.60 GBytes  3.09 Gbits/sec

Well, it is actually a lot faster than Go's version. Right?

EDIT:

I was using aes-256-gcm method.
sstunnel and ssserver were built without single-threaded feature (so it run with multi-thread scheduler).
Built without trust-dns feature (it is still broken anyway). So I am using tokio's default DNS resolver.

Nov 30 '19 16:11 zonyitoo

Maybe it is related to OS?

Could you use "-n 10G" to iperf? So we can compare to the result with the result on Windows?

Nov 30 '19 16:11 cg31

I have already use -n 10G on iperf.

iperf -c localhost -p 1300 -n 10G

------------------------------------------------------------
Client connecting to localhost, TCP port 1300
TCP window size:  336 KByte (default)
------------------------------------------------------------
[  5] local 127.0.0.1 port 64505 connected with 127.0.0.1 port 1300
write failed: Broken pipe
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec  3.59 GBytes  3.08 Gbits/sec

Maybe the problem is write failed: Broken pipe.

Nov 30 '19 16:11 zonyitoo

TEST-1:

iperf -> sstunnel (rust) -> go-shadowsocks2 -> iperf

[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-11.6 sec  10.0 GBytes  7.39 Gbits/sec

TEST-2:

iperf -> sstunnel (rust) -> ssserver (rust) -> iperf

[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-12.6 sec  10.0 GBytes  6.80 Gbits/sec

Nov 30 '19 17:11 zonyitoo

Another test pairs:

TEST-1:

iperf -> sstunnel (rust) -> go-shadowsocks2 -> iperf

[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-37.5 sec  10.0 GBytes  2.29 Gbits/sec

TEST-2:

iperf -> sstunnel (rust) -> ssserver (rust) -> iperf

[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-29.2 sec  10.0 GBytes  2.95 Gbits/sec

Result is not stable. But overall, ssserver is faster than go-shadowsocks2.

Nov 30 '19 17:11 zonyitoo

It could be a Windows specific issue, maybe in rust compiler or in libraries, if we can get a good performance on Linux as well.

Dec 01 '19 02:12 cg31

I just tried rust and go version on my Linux box, it is a ThinkStation with 56 cores with Ubuntu 18.04.

Rust version uses plain method, and go version uses dummy method.

Rust version gets: [ 3] 0.0-14.3 sec 10.0 GBytes 6.01 Gbits/sec

Go version is incredible: [ 3] 0.0- 3.9 sec 10.0 GBytes 22.1 Gbits/sec

Dec 02 '19 02:12 cg31

Hmm.. nearly 4 times.

In this test, I don't think the number of cores in CPU is important, because it has only one connection. So I think the key is task scheduler's performance.

Or.. Maybe Rust's scheduler performs worse than Go in multicore environment? Could you test it with Rust's single-threaded feature and Go's GOMAXPROCS=1? (I think the result should be the same).

Dec 02 '19 05:12 zonyitoo

When I test the proxy sample: https://github.com/tokio-rs/tokio/blob/master/examples/proxy.rs

in these steps: ./proxy 127.0.0.1:1090 127.0.0.1:5201 & iperf -s -p 5201 & iperf -c 127.0.0.1 -p 1090 -n 10G

I got: [ 3] 0.0-13.4 sec 10.0 GBytes 6.40 Gbits/sec

it is similar to ss-rust. It means the bottleneck is not in ss-rust itself, but in tokio.

Dec 02 '19 05:12 cg31

I created a repo with go and rust tests to benchmark the relay speeds https://github.com/cg31/PerfectRelay

Dec 02 '19 07:12 cg31

I tried to use std::net to relay the data, it looks just faster: https://github.com/cg31/PerfectRelay/tree/master/rust_std

Dec 02 '19 12:12 cg31

I tried to use std::net to relay the data, it looks just faster: https://github.com/cg31/PerfectRelay/tree/master/rust_std

That means the problem is in tokio's scheduler.

I am curious about the result of async-std.

Dec 02 '19 13:12 zonyitoo

async-std is faster than tokio, but still a lot slower than std: https://github.com/cg31/PerfectRelay/tree/master/rust_async_std

On Windows 10, i7 machine:

tokio: [264] 0.0-78.5 sec 10.0 GBytes 1.09 Gbits/sec

async-std: [268] 0.0-29.9 sec 10.0 GBytes 2.87 Gbits/se

std: [272] 0.0- 9.3 sec 10.0 GBytes 9.24 Gbits/sec

Dec 02 '19 14:12 cg31

add BufReader and BufWriter will help BufReader can be added directly

let mut ri = tokio::io::BufReader::new(ri);
let mut ro = tokio::io::BufReader::new(ro);

BufWriter need to work with poll_flush

Dec 05 '19 14:12 zyctree

here is my result of https://github.com/cg31/PerfectRelay/tree/master/rust_async_std

# without BufReader
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-11.8 sec  10.0 GBytes  7.26 Gbits/sec

# with BufReader
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 9.2 sec  10.0 GBytes  9.34 Gbits/sec

Dec 05 '19 14:12 zyctree

Readers are already built with buffers (for instance, aead.rs).

Dec 05 '19 14:12 zonyitoo

A BufReader and also a BufWriter did the trick.

without BufReader and BufWriter https://github.com/cg31/PerfectRelay/tree/master/rust_tokio [384] 0.0-101.1 sec 10.0 GBytes 850 Mbits/sec

only using BufReader [384] 0.0-93.3 sec 10.0 GBytes 921 Mbits/sec

using both BufReader and BufWriter https://github.com/cg31/PerfectRelay/tree/master/rust_tokio_buf [384] 0.0-30.5 sec 10.0 GBytes 2.82 Gbits/sec

Dec 05 '19 16:12 cg31

I ran this test with latest release vs go-shadowsocks2 under Debian stable with chacha20-ietf-poly1305, here's the result.

shadowsocks-rust

[  3]  0.0-21.3 sec  10.0 GBytes  4.04 Gbits/sec
[  3]  0.0-20.9 sec  10.0 GBytes  4.12 Gbits/sec
[  3]  0.0-21.0 sec  10.0 GBytes  4.09 Gbits/sec

go-shadowsocks2

[  3]  0.0-12.4 sec  10.0 GBytes  6.92 Gbits/sec
[  3]  0.0-14.0 sec  10.0 GBytes  6.14 Gbits/sec
[  3]  0.0-13.6 sec  10.0 GBytes  6.29 Gbits/sec

Jun 08 '20 06:06 kklem0

Here's something interesting for RPi users.

I tested shadowsocks-rust and go-shadowsocks2 under RPi4.

shadowsocks-rust server + go-shadowsocks2 client

[  3]  0.0-283.5 sec  5.60 GBytes   170 Mbits/sec
[  3]  0.0-50.5 sec  1.00 GBytes   170 Mbits/sec

go-shadowsocks2 server + go-shadowsocks2 client

[  3]  0.0-52.0 sec  1.00 GBytes   165 Mbits/sec
[  3]  0.0-51.8 sec  1.00 GBytes   166 Mbits/sec

shadowsocks-rust server + shadowsocks-rust client

[  3]  0.0-13.4 sec  1.00 GBytes   641 Mbits/sec
[  3]  0.0-13.8 sec  1.00 GBytes   625 Mbits/sec
[  3]  0.0-13.5 sec  1.00 GBytes   635 Mbits/sec

Jun 08 '20 06:06 kklem0

@zonyitoo shall we keep this issue open and pin it for discussing performance and for people to post benchmark results?

Jun 08 '20 10:06 kklem0

Sure.

Jun 08 '20 15:06 zonyitoo

So your data result is: on x86_64 platform, ss-rust 4GiB/sec slower than ss-go2 6GiB/sec, but on RPi (ARM), ss-rust is faster.

Jun 08 '20 15:06 zonyitoo

We can first enable some general optimization options, such as using jemallocator and tokio's parking_lot feature.

Jun 08 '20 15:06 quininer

jemalloc may increase binary size, which is not good for embedded environments.

Jun 09 '20 08:06 zonyitoo

we don't need to enable it by default, and at least on desktop and server we don't need to worry about binary size.

Jun 09 '20 08:06 quininer

shadowsocks-rust shadowsocks-rust copied to clipboard

Performance regression after switching to tokio-0.2

shadowsocks-rust
shadowsocks-rust copied to clipboard