webrtc
webrtc copied to clipboard
[WebRTC] investigate data-channels-flow-control example throughput performance issue
Pion has more than 500Mbps+,
Peer Connection State has changed: connected (offerer)
Peer Connection State has changed: connected (answerer)
2021/09/25 13:27:29 OnOpen: data-824638619994. Start sending a series of 1024-byte packets as fast as it can
2021/09/25 13:27:29 OnOpen: data-824636958938. Start receiving data
2021/09/25 13:27:30 Throughput: 570.646 Mbps
2021/09/25 13:27:31 Throughput: 569.753 Mbps
2021/09/25 13:27:32 Throughput: 573.001 Mbps
2021/09/25 13:27:33 Throughput: 572.452 Mbps
2021/09/25 13:27:34 Throughput: 571.297 Mbps
2021/09/25 13:27:35 Throughput: 569.525 Mbps
2021/09/25 13:27:36 Throughput: 567.463 Mbps
...
but, webrtc-rs only has around 13Mbps:
Peer Connection State has changed: connected (offerer)
Peer Connection State has changed: connected (answerer)
OnOpen: data-1. Start sending a series of 1024-byte packets as fast as it can
OnOpen: data-1. Start receiving data
Throughput: 12.990 Mbps
Throughput: 13.698 Mbps
Throughput: 13.559 Mbps
Throughput: 13.345 Mbps
Throughput: 13.565 Mbps
Throughput: 13.582 Mbps
cargo build --release --example data-channels-flow-control increases performance, but not comparable to pion.
./target/release/examples/data-channels-flow-control Press ctlr-c to stop Peer Connection State has changed: connected (offerer) Peer Connection State has changed: connected (answerer) OnOpen: data-1. Start sending a series of 1024-byte packets as fast as it can OnOpen: data-1. Start receiving data Throughput: 175.556 Mbps Throughput: 106.104 Mbps Throughput: 76.986 Mbps Throughput: 61.450 Mbps Throughput: 51.632 Mbps Throughput: 44.797 Mbps Throughput: 39.733 Mbps Throughput: 35.619 Mbps Throughput: 32.330 Mbps Throughput: 29.491 Mbps Throughput: 43.142 Mbps Throughput: 48.350 Mbps Throughput: 46.386 Mbps Throughput: 44.221 Mbps Throughput: 48.071 Mbps Throughput: 55.550 Mbps Throughput: 53.980 Mbps Throughput: 52.263 Mbps
should be tokio performance limit
some other benchmark
goroutines: 3.22234675s total, 3.222346ms avg per iteration rust_threads: 16.980509645s total, 16.980509ms avg per iteration rust_tokio: 9.56997204s total, 9.569972ms avg per iteration rust_tokio_block_in_place: 3.578928749s total, 3.578928ms avg per iteration
https://www.reddit.com/r/rust/comments/lg0a7b/benchmarking_tokio_tasks_and_goroutines/
@rainliu I made some benchmarks between Go(Pion) and Rust with long running time, maybe will help.
Go(Pion) From start of benchmark throughput was grow up to 844Mps
17:49:06 Throughput: 721.371 Mbps
17:49:07 Throughput: 727.991 Mbps
17:49:08 Throughput: 743.665 Mbps
...
17:49:39 Throughput: 842.728 Mbps
17:49:40 Throughput: 843.339 Mbps
17:49:41 Throughput: 843.672 Mbps
17:49:42 Throughput: 844.272 Mbps
17:49:43 Throughput: 844.782 Mbps
17:49:44 Throughput: 844.855 Mbps
after has been throwed an exception
mux ERROR: 17:49:45 mux: ending readLoop dispatch error packetio.Buffer is full, discarding write
and throughput started to go slow down without stop to stable point. I stopped benchmark on
18:44:45 Throughput: 9.966 Mbps
Rust From start of benchmark throughput was
Throughput: 229.521 Mbps
Throughput: 231.489 Mbps
Throughput: 231.780 Mbps
Throughput: 231.662 Mbps
Throughput: 231.965 Mbps
after that, started to go down and reached lowest point
Throughput: 23.023 Mbps
Throughput: 22.849 Mbps
Throughput: 22.677 Mbps
after started slowly to grow up and balancing between 66.436 Mbps - 41.511 Mbps
CPU/RAM
Go
RAM 22.304 MB
CPU ~1%
Rust
RAM 162MB(stopped at that point, without recycling)
CPU ~120%
should be tokio performance limit
some other benchmark
goroutines: 3.22234675s total, 3.222346ms avg per iteration rust_threads: 16.980509645s total, 16.980509ms avg per iteration rust_tokio: 9.56997204s total, 9.569972ms avg per iteration rust_tokio_block_in_place: 3.578928749s total, 3.578928ms avg per iteration
https://www.reddit.com/r/rust/comments/lg0a7b/benchmarking_tokio_tasks_and_goroutines/
@whans Benchmarks was made on I/O for file, not on socket. Linux have different behavior for files and sockets.
If you will try std's file IO inside async block or task::block_in_place you will have very fast values.
That is because Linux use read-ahead for files.
Also benchmarks was made on /dev/urandom and /dev/null which is in-memory files.
@rainliu I suppose that, here example use socket connection.
@vitdevelop thanks compare std::net::UdpSocket vs tokio::net::UdpSocket std::UdpSocket is almost twice as fast as tokio::UdpSocket
thanks @vitdevelop and @whans for the benchmarking.
Look like we need some efforts to profile the hotspots/bottlenecks.
add tokio console to check the schedule issue
https://github.com/tokio-rs/console
add tokio console to check the schedule issue
https://github.com/tokio-rs/console
@whans thanks for tokio-console, awesome tool
I tried to check with tokio-console busy/idle times for tasks but didn't connect to console_subscriber.
After I figured out that before offer/answer I can connect and I putted some tokio::time::sleep points to see the image,
after offer/answer started to execute, console hangs up.
- 10 sec before
create_oferer - 3 sec before
create_answerer - 3 sec before
create_offerandset_remote_description - 3 sec before
create_answerandset_remote_description
Here is last tokio-console data

@vitdevelop you need to slow down the packet sending rate. add sleep in sending task
@whans Sleep in sending task helped for some time
@rainliu After ~5 min I figured out that a lot of tasks webrtc-sctp-0.3.8/src/timer/[ack,rtx]_timer.rs:[43,153] was spawned, around 500-600.
Most cpu busy time is in that tasks.
I attached the screenshot.

@vitdevelop, thanks for the finding. Could you submit a PR to add tokio-console/console_subscriber to data-channels-flow-control example? so, I can take a look.
@rainliu Added PR webrtc-rs/examples#1
output about: perf top -p

I was doing some testing of my own on sctp and on just write to streams it takes almost 22x slower compared to same example on pion side. I was seeing an issue to use IO-free state machine style, while that will speed up the protocol it still doesnt explain the root cause. The rust example shouldn't be 22x slower it should be atleast comparable to the pion one if not faster. Has anyone explored the root cause of such slow down of the protocol?