shadowsocks-rust
shadowsocks-rust copied to clipboard
Performance regression after switching to tokio-0.2
I tried to use go-shadowsocks2 as tunnel to have a test, the result doesn't look great...
It was shadowsocks-rust-async branch release build I was using in this test.
This is what I did:
-
start shadowsocks-rust server: ssserver -s 127.0.0.1:8488 -m chacha20-ietf -k password
-
start go-shadowsocks2 as tunnel: go-shadowsocks2 -verbose -c ss://chacha20-ietf:password@localhost:8488 -tcptun :1090=127.0.0.1:5201
-
start iperf server: iperf -s -p 5201
-
run iperf as client: iperf -c localhost -p 1090 -n 10G
The result was: [388] 0.0-130.7 sec 10.0 GBytes 657 Mbits/sec
If I replaced step 1 with go-shadowsocks2 as server: go-shadowsocks2 -verbose -s ss://chacha20-ietf:password@:8488
The result was: [384] 0.0-27.4 sec 10.0 GBytes 3.13 Gbits/sec
I have seen many complains about performance regression after migrating to tokio v0.2.
Let's findout if there are any things could be optimized in this project first.
I just tried the master branch, it looks like also needed some optimization [384] 0.0-89.2 sec 10.0 GBytes 963 Mbits/sec
This is where a tunnel is useful.
For the record, I am using stable-x86_64-pc-windows-msvc, rustc 1.39.0.
master
branch is using future v0.1, which requires lots of Future
s' transformation, that should be slower.
I didn't see any obvious performance problem on the feature-migrate-async-await
branch. Any idea?
It is hard to spot the bottleneck just by looking at the code. Profiling tools are needed to find the hot spots.
My test with the lastest commit (with tokio v0.2.2)
System
Darwin 19.0.0 Darwin Kernel Version 19.0.0: Thu Oct 17 16:17:15 PDT 2019; root:xnu-6153.41.3~29/RELEASE_X86_64 x86_64
TEST-1:
iperf -> sstunnel (rust) -> go-shadowsocks2 -> iperf
[ ID] Interval Transfer Bandwidth
[ 5] 0.0-10.0 sec 2.92 GBytes 2.50 Gbits/sec
TEST-2:
iperf -> sstunnel (rust) -> ssserver (rust) -> iperf
[ ID] Interval Transfer Bandwidth
[ 5] 0.0-10.0 sec 3.60 GBytes 3.09 Gbits/sec
Well, it is actually a lot faster than Go's version. Right?
EDIT:
- I was using
aes-256-gcm
method. -
sstunnel
andssserver
were built withoutsingle-threaded
feature (so it run with multi-thread scheduler). - Built without
trust-dns
feature (it is still broken anyway). So I am using tokio's default DNS resolver.
Maybe it is related to OS?
Could you use "-n 10G" to iperf? So we can compare to the result with the result on Windows?
I have already use -n 10G
on iperf
.
iperf -c localhost -p 1300 -n 10G
------------------------------------------------------------
Client connecting to localhost, TCP port 1300
TCP window size: 336 KByte (default)
------------------------------------------------------------
[ 5] local 127.0.0.1 port 64505 connected with 127.0.0.1 port 1300
write failed: Broken pipe
[ ID] Interval Transfer Bandwidth
[ 5] 0.0-10.0 sec 3.59 GBytes 3.08 Gbits/sec
Maybe the problem is write failed: Broken pipe
.
TEST-1:
iperf -> sstunnel (rust) -> go-shadowsocks2 -> iperf
[ ID] Interval Transfer Bandwidth
[ 5] 0.0-11.6 sec 10.0 GBytes 7.39 Gbits/sec
TEST-2:
iperf -> sstunnel (rust) -> ssserver (rust) -> iperf
[ ID] Interval Transfer Bandwidth
[ 5] 0.0-12.6 sec 10.0 GBytes 6.80 Gbits/sec
Another test pairs:
TEST-1:
iperf -> sstunnel (rust) -> go-shadowsocks2 -> iperf
[ ID] Interval Transfer Bandwidth
[ 5] 0.0-37.5 sec 10.0 GBytes 2.29 Gbits/sec
TEST-2:
iperf -> sstunnel (rust) -> ssserver (rust) -> iperf
[ ID] Interval Transfer Bandwidth
[ 5] 0.0-29.2 sec 10.0 GBytes 2.95 Gbits/sec
Result is not stable. But overall, ssserver
is faster than go-shadowsocks2
.
It could be a Windows specific issue, maybe in rust compiler or in libraries, if we can get a good performance on Linux as well.
I just tried rust and go version on my Linux box, it is a ThinkStation with 56 cores with Ubuntu 18.04.
Rust version uses plain method, and go version uses dummy method.
Rust version gets: [ 3] 0.0-14.3 sec 10.0 GBytes 6.01 Gbits/sec
Go version is incredible: [ 3] 0.0- 3.9 sec 10.0 GBytes 22.1 Gbits/sec
Hmm.. nearly 4 times.
In this test, I don't think the number of cores in CPU is important, because it has only one connection. So I think the key is task scheduler's performance.
Or.. Maybe Rust's scheduler performs worse than Go in multicore environment? Could you test it with Rust's single-threaded
feature and Go's GOMAXPROCS=1
? (I think the result should be the same).
When I test the proxy sample: https://github.com/tokio-rs/tokio/blob/master/examples/proxy.rs
in these steps: ./proxy 127.0.0.1:1090 127.0.0.1:5201 & iperf -s -p 5201 & iperf -c 127.0.0.1 -p 1090 -n 10G
I got: [ 3] 0.0-13.4 sec 10.0 GBytes 6.40 Gbits/sec
it is similar to ss-rust. It means the bottleneck is not in ss-rust itself, but in tokio.
I created a repo with go and rust tests to benchmark the relay speeds https://github.com/cg31/PerfectRelay
I tried to use std::net to relay the data, it looks just faster: https://github.com/cg31/PerfectRelay/tree/master/rust_std
I tried to use std::net to relay the data, it looks just faster: https://github.com/cg31/PerfectRelay/tree/master/rust_std
That means the problem is in tokio's scheduler.
I am curious about the result of async-std
.
async-std is faster than tokio, but still a lot slower than std: https://github.com/cg31/PerfectRelay/tree/master/rust_async_std
On Windows 10, i7 machine:
tokio: [264] 0.0-78.5 sec 10.0 GBytes 1.09 Gbits/sec
async-std: [268] 0.0-29.9 sec 10.0 GBytes 2.87 Gbits/se
std: [272] 0.0- 9.3 sec 10.0 GBytes 9.24 Gbits/sec
add BufReader
and BufWriter
will help
BufReader
can be added directly
let mut ri = tokio::io::BufReader::new(ri);
let mut ro = tokio::io::BufReader::new(ro);
BufWriter
need to work with poll_flush
here is my result of https://github.com/cg31/PerfectRelay/tree/master/rust_async_std
# without BufReader
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-11.8 sec 10.0 GBytes 7.26 Gbits/sec
# with BufReader
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 9.2 sec 10.0 GBytes 9.34 Gbits/sec
Readers are already built with buffers (for instance, aead.rs
).
A BufReader and also a BufWriter did the trick.
without BufReader and BufWriter https://github.com/cg31/PerfectRelay/tree/master/rust_tokio [384] 0.0-101.1 sec 10.0 GBytes 850 Mbits/sec
only using BufReader [384] 0.0-93.3 sec 10.0 GBytes 921 Mbits/sec
using both BufReader and BufWriter https://github.com/cg31/PerfectRelay/tree/master/rust_tokio_buf [384] 0.0-30.5 sec 10.0 GBytes 2.82 Gbits/sec
I ran this test with latest release vs go-shadowsocks2 under Debian stable with chacha20-ietf-poly1305, here's the result.
shadowsocks-rust
[ 3] 0.0-21.3 sec 10.0 GBytes 4.04 Gbits/sec
[ 3] 0.0-20.9 sec 10.0 GBytes 4.12 Gbits/sec
[ 3] 0.0-21.0 sec 10.0 GBytes 4.09 Gbits/sec
go-shadowsocks2
[ 3] 0.0-12.4 sec 10.0 GBytes 6.92 Gbits/sec
[ 3] 0.0-14.0 sec 10.0 GBytes 6.14 Gbits/sec
[ 3] 0.0-13.6 sec 10.0 GBytes 6.29 Gbits/sec
Here's something interesting for RPi users.
I tested shadowsocks-rust and go-shadowsocks2 under RPi4.
shadowsocks-rust server + go-shadowsocks2 client
[ 3] 0.0-283.5 sec 5.60 GBytes 170 Mbits/sec
[ 3] 0.0-50.5 sec 1.00 GBytes 170 Mbits/sec
go-shadowsocks2 server + go-shadowsocks2 client
[ 3] 0.0-52.0 sec 1.00 GBytes 165 Mbits/sec
[ 3] 0.0-51.8 sec 1.00 GBytes 166 Mbits/sec
shadowsocks-rust server + shadowsocks-rust client
[ 3] 0.0-13.4 sec 1.00 GBytes 641 Mbits/sec
[ 3] 0.0-13.8 sec 1.00 GBytes 625 Mbits/sec
[ 3] 0.0-13.5 sec 1.00 GBytes 635 Mbits/sec
@zonyitoo shall we keep this issue open and pin it for discussing performance and for people to post benchmark results?
Sure.
So your data result is: on x86_64 platform, ss-rust 4GiB/sec slower than ss-go2 6GiB/sec, but on RPi (ARM), ss-rust is faster.
We can first enable some general optimization options, such as using jemallocator
and tokio's parking_lot
feature.
jemalloc may increase binary size, which is not good for embedded environments.
we don't need to enable it by default, and at least on desktop and server we don't need to worry about binary size.