shadowsocks-rust icon indicating copy to clipboard operation
shadowsocks-rust copied to clipboard

Performance regression after switching to tokio-0.2

Open cg31 opened this issue 4 years ago • 40 comments

I tried to use go-shadowsocks2 as tunnel to have a test, the result doesn't look great...

It was shadowsocks-rust-async branch release build I was using in this test.

This is what I did:

  1. start shadowsocks-rust server: ssserver -s 127.0.0.1:8488 -m chacha20-ietf -k password

  2. start go-shadowsocks2 as tunnel: go-shadowsocks2 -verbose -c ss://chacha20-ietf:password@localhost:8488 -tcptun :1090=127.0.0.1:5201

  3. start iperf server: iperf -s -p 5201

  4. run iperf as client: iperf -c localhost -p 1090 -n 10G

The result was: [388] 0.0-130.7 sec 10.0 GBytes 657 Mbits/sec

If I replaced step 1 with go-shadowsocks2 as server: go-shadowsocks2 -verbose -s ss://chacha20-ietf:password@:8488

The result was: [384] 0.0-27.4 sec 10.0 GBytes 3.13 Gbits/sec

cg31 avatar Nov 29 '19 16:11 cg31

I have seen many complains about performance regression after migrating to tokio v0.2.

Let's findout if there are any things could be optimized in this project first.

zonyitoo avatar Nov 29 '19 16:11 zonyitoo

I just tried the master branch, it looks like also needed some optimization [384] 0.0-89.2 sec 10.0 GBytes 963 Mbits/sec

This is where a tunnel is useful.

For the record, I am using stable-x86_64-pc-windows-msvc, rustc 1.39.0.

cg31 avatar Nov 29 '19 16:11 cg31

master branch is using future v0.1, which requires lots of Futures' transformation, that should be slower.

zonyitoo avatar Nov 29 '19 17:11 zonyitoo

I didn't see any obvious performance problem on the feature-migrate-async-await branch. Any idea?

zonyitoo avatar Nov 30 '19 14:11 zonyitoo

It is hard to spot the bottleneck just by looking at the code. Profiling tools are needed to find the hot spots.

cg31 avatar Nov 30 '19 15:11 cg31

My test with the lastest commit (with tokio v0.2.2)

System

Darwin 19.0.0 Darwin Kernel Version 19.0.0: Thu Oct 17 16:17:15 PDT 2019; root:xnu-6153.41.3~29/RELEASE_X86_64 x86_64

TEST-1:

iperf -> sstunnel (rust) -> go-shadowsocks2 -> iperf

[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec  2.92 GBytes  2.50 Gbits/sec

TEST-2:

iperf -> sstunnel (rust) -> ssserver (rust) -> iperf

[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec  3.60 GBytes  3.09 Gbits/sec

Well, it is actually a lot faster than Go's version. Right?

EDIT:

  1. I was using aes-256-gcm method.
  2. sstunnel and ssserver were built without single-threaded feature (so it run with multi-thread scheduler).
  3. Built without trust-dns feature (it is still broken anyway). So I am using tokio's default DNS resolver.

zonyitoo avatar Nov 30 '19 16:11 zonyitoo

Maybe it is related to OS?

Could you use "-n 10G" to iperf? So we can compare to the result with the result on Windows?

cg31 avatar Nov 30 '19 16:11 cg31

I have already use -n 10G on iperf.

iperf -c localhost -p 1300 -n 10G

------------------------------------------------------------
Client connecting to localhost, TCP port 1300
TCP window size:  336 KByte (default)
------------------------------------------------------------
[  5] local 127.0.0.1 port 64505 connected with 127.0.0.1 port 1300
write failed: Broken pipe
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec  3.59 GBytes  3.08 Gbits/sec

Maybe the problem is write failed: Broken pipe.

zonyitoo avatar Nov 30 '19 16:11 zonyitoo

TEST-1:

iperf -> sstunnel (rust) -> go-shadowsocks2 -> iperf

[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-11.6 sec  10.0 GBytes  7.39 Gbits/sec

TEST-2:

iperf -> sstunnel (rust) -> ssserver (rust) -> iperf

[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-12.6 sec  10.0 GBytes  6.80 Gbits/sec

zonyitoo avatar Nov 30 '19 17:11 zonyitoo

Another test pairs:

TEST-1:

iperf -> sstunnel (rust) -> go-shadowsocks2 -> iperf

[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-37.5 sec  10.0 GBytes  2.29 Gbits/sec

TEST-2:

iperf -> sstunnel (rust) -> ssserver (rust) -> iperf

[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-29.2 sec  10.0 GBytes  2.95 Gbits/sec

Result is not stable. But overall, ssserver is faster than go-shadowsocks2.

zonyitoo avatar Nov 30 '19 17:11 zonyitoo

It could be a Windows specific issue, maybe in rust compiler or in libraries, if we can get a good performance on Linux as well.

cg31 avatar Dec 01 '19 02:12 cg31

I just tried rust and go version on my Linux box, it is a ThinkStation with 56 cores with Ubuntu 18.04.

Rust version uses plain method, and go version uses dummy method.

Rust version gets: [ 3] 0.0-14.3 sec 10.0 GBytes 6.01 Gbits/sec

Go version is incredible: [ 3] 0.0- 3.9 sec 10.0 GBytes 22.1 Gbits/sec

cg31 avatar Dec 02 '19 02:12 cg31

Hmm.. nearly 4 times.

In this test, I don't think the number of cores in CPU is important, because it has only one connection. So I think the key is task scheduler's performance.

Or.. Maybe Rust's scheduler performs worse than Go in multicore environment? Could you test it with Rust's single-threaded feature and Go's GOMAXPROCS=1? (I think the result should be the same).

zonyitoo avatar Dec 02 '19 05:12 zonyitoo

When I test the proxy sample: https://github.com/tokio-rs/tokio/blob/master/examples/proxy.rs

in these steps: ./proxy 127.0.0.1:1090 127.0.0.1:5201 & iperf -s -p 5201 & iperf -c 127.0.0.1 -p 1090 -n 10G

I got: [ 3] 0.0-13.4 sec 10.0 GBytes 6.40 Gbits/sec

it is similar to ss-rust. It means the bottleneck is not in ss-rust itself, but in tokio.

cg31 avatar Dec 02 '19 05:12 cg31

I created a repo with go and rust tests to benchmark the relay speeds https://github.com/cg31/PerfectRelay

cg31 avatar Dec 02 '19 07:12 cg31

I tried to use std::net to relay the data, it looks just faster: https://github.com/cg31/PerfectRelay/tree/master/rust_std

cg31 avatar Dec 02 '19 12:12 cg31

I tried to use std::net to relay the data, it looks just faster: https://github.com/cg31/PerfectRelay/tree/master/rust_std

That means the problem is in tokio's scheduler.

I am curious about the result of async-std.

zonyitoo avatar Dec 02 '19 13:12 zonyitoo

async-std is faster than tokio, but still a lot slower than std: https://github.com/cg31/PerfectRelay/tree/master/rust_async_std

On Windows 10, i7 machine:

tokio: [264] 0.0-78.5 sec 10.0 GBytes 1.09 Gbits/sec

async-std: [268] 0.0-29.9 sec 10.0 GBytes 2.87 Gbits/se

std: [272] 0.0- 9.3 sec 10.0 GBytes 9.24 Gbits/sec

cg31 avatar Dec 02 '19 14:12 cg31

add BufReader and BufWriter will help BufReader can be added directly

let mut ri = tokio::io::BufReader::new(ri);
let mut ro = tokio::io::BufReader::new(ro);

BufWriter need to work with poll_flush

zyctree avatar Dec 05 '19 14:12 zyctree

here is my result of https://github.com/cg31/PerfectRelay/tree/master/rust_async_std

# without BufReader
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-11.8 sec  10.0 GBytes  7.26 Gbits/sec

# with BufReader
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 9.2 sec  10.0 GBytes  9.34 Gbits/sec

zyctree avatar Dec 05 '19 14:12 zyctree

Readers are already built with buffers (for instance, aead.rs).

zonyitoo avatar Dec 05 '19 14:12 zonyitoo

A BufReader and also a BufWriter did the trick.

without BufReader and BufWriter https://github.com/cg31/PerfectRelay/tree/master/rust_tokio [384] 0.0-101.1 sec 10.0 GBytes 850 Mbits/sec

only using BufReader [384] 0.0-93.3 sec 10.0 GBytes 921 Mbits/sec

using both BufReader and BufWriter https://github.com/cg31/PerfectRelay/tree/master/rust_tokio_buf [384] 0.0-30.5 sec 10.0 GBytes 2.82 Gbits/sec

cg31 avatar Dec 05 '19 16:12 cg31

I ran this test with latest release vs go-shadowsocks2 under Debian stable with chacha20-ietf-poly1305, here's the result.

shadowsocks-rust

[  3]  0.0-21.3 sec  10.0 GBytes  4.04 Gbits/sec
[  3]  0.0-20.9 sec  10.0 GBytes  4.12 Gbits/sec
[  3]  0.0-21.0 sec  10.0 GBytes  4.09 Gbits/sec

go-shadowsocks2

[  3]  0.0-12.4 sec  10.0 GBytes  6.92 Gbits/sec
[  3]  0.0-14.0 sec  10.0 GBytes  6.14 Gbits/sec
[  3]  0.0-13.6 sec  10.0 GBytes  6.29 Gbits/sec

kklem0 avatar Jun 08 '20 06:06 kklem0

Here's something interesting for RPi users.

I tested shadowsocks-rust and go-shadowsocks2 under RPi4.

shadowsocks-rust server + go-shadowsocks2 client

[  3]  0.0-283.5 sec  5.60 GBytes   170 Mbits/sec
[  3]  0.0-50.5 sec  1.00 GBytes   170 Mbits/sec

go-shadowsocks2 server + go-shadowsocks2 client

[  3]  0.0-52.0 sec  1.00 GBytes   165 Mbits/sec
[  3]  0.0-51.8 sec  1.00 GBytes   166 Mbits/sec

shadowsocks-rust server + shadowsocks-rust client

[  3]  0.0-13.4 sec  1.00 GBytes   641 Mbits/sec
[  3]  0.0-13.8 sec  1.00 GBytes   625 Mbits/sec
[  3]  0.0-13.5 sec  1.00 GBytes   635 Mbits/sec

kklem0 avatar Jun 08 '20 06:06 kklem0

@zonyitoo shall we keep this issue open and pin it for discussing performance and for people to post benchmark results?

kklem0 avatar Jun 08 '20 10:06 kklem0

Sure.

zonyitoo avatar Jun 08 '20 15:06 zonyitoo

So your data result is: on x86_64 platform, ss-rust 4GiB/sec slower than ss-go2 6GiB/sec, but on RPi (ARM), ss-rust is faster.

zonyitoo avatar Jun 08 '20 15:06 zonyitoo

We can first enable some general optimization options, such as using jemallocator and tokio's parking_lot feature.

quininer avatar Jun 08 '20 15:06 quininer

jemalloc may increase binary size, which is not good for embedded environments.

zonyitoo avatar Jun 09 '20 08:06 zonyitoo

we don't need to enable it by default, and at least on desktop and server we don't need to worry about binary size.

quininer avatar Jun 09 '20 08:06 quininer