Tide without TCP_NODELAY performs unfavorably in benchmarks
I wrote 2 performance testing minimal HTTP servers, one for tide (1.6) and one for Node.JS, what I expect is that the tide one should faster than the Node.JS one. But the result is the opposite..., anyone has a try on this?:)
Testing environment:
macOS 10.14.6
Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz [ 6 cores ]
rustc 1.50.0 (cb75ad5db 2021-02-10)
Here are the source code for both versions:
Both HTTP server have the same routes and responses below:
-
HTMLresponse (Benchmark testing) for the default route/: -
JSONresponse for the/json-benchmarkroute:{ "name":"Wison Ye", "role":"Administrator", "settings":{ "prefer_language":"English", "reload_when_changed":true } }
Here is my ulimit -a output:
Maximum size of core files created (kB, -c) 0
Maximum size of a processΥ³ data segment (kB, -d) unlimited
Maximum size of files created by the shell (kB, -f) unlimited
Maximum size that may be locked into memory (kB, -l) unlimited
Maximum resident set size (kB, -m) unlimited
Maximum number of open file descriptors (-n) 1000000
Maximum stack size (kB, -s) 8192
Maximum amount of cpu time in seconds (seconds, -t) unlimited
Maximum number of processes available to a single user (-u) 3546
Maximum amount of virtual memory available to the shell (kB, -v) unlimited
Here is the test result:
-
NodeJS version:
recreate node project:
npm init -y // copy the `benchmark_server.js` to current folder npm install restify restify-errors node --version # v14.16.0Node spwan 6 cluster workers to serve:
node benchmark_server.js # setupMaster Cluster worker amount: 6 # setupMaster Cluster worker "1" (PID: 10719) is online. # setupMaster Cluster worker "3" (PID: 10721) is online. # setupMaster Cluster worker "2" (PID: 10720) is online. # setupMaster Cluster worker "4" (PID: 10722) is online. # setupMaster Cluster worker "5" (PID: 10723) is online. # setupMaster Cluster worker "6" (PID: 10724) is online. # run Worker Process 3 (PID: 10721) | "Benchmark Http Server" is running at http://127.0.0.1:8080 # setupMaster Cluster worker "3" (PID: 10721) is listening on 127.0.0.1:8080. # run Worker Process 2 (PID: 10720) | "Benchmark Http Server" is running at http://127.0.0.1:8080 # setupMaster Cluster worker "2" (PID: 10720) is listening on 127.0.0.1:8080. # run Worker Process 6 (PID: 10724) | "Benchmark Http Server" is running at http://127.0.0.1:8080 # setupMaster Cluster worker "6" (PID: 10724) is listening on 127.0.0.1:8080. # run Worker Process 1 (PID: 10719) | "Benchmark Http Server" is running at http://127.0.0.1:8080 # setupMaster Cluster worker "1" (PID: 10719) is listening on 127.0.0.1:8080. # run Worker Process 4 (PID: 10722) | "Benchmark Http Server" is running at http://127.0.0.1:8080 # setupMaster Cluster worker "4" (PID: 10722) is listening on 127.0.0.1:8080. # run Worker Process 5 (PID: 10723) | "Benchmark Http Server" is running at http://127.0.0.1:8080 # setupMaster Cluster worker "5" (PID: 10723) is listening on 127.0.0.1:8080.
# `/` default route wrk --thread 8 --connections 5000 --duration 10s --latency http://127.0.0.1:8080/ Running 10s test @ http://127.0.0.1:8080/ 8 threads and 5000 connections Thread Stats Avg Stdev Max +/- Stdev Latency 7.21ms 7.06ms 165.76ms 90.33% Req/Sec 10.55k 5.94k 40.11k 80.46% Latency Distribution 50% 5.57ms 75% 9.21ms 90% 14.07ms 99% 31.48ms 769416 requests in 10.06s, 151.16MB read Socket errors: connect 0, read 1251, write 0, timeout 0 Requests/sec: 76466.35 Transfer/sec: 15.02MB# `/json-benchmark` route wrk --thread 8 --connections 5000 --duration 10s --latency http://127.0.0.1:8080/json-benchmark Running 10s test @ http://127.0.0.1:8080/json-benchmark 8 threads and 5000 connections Thread Stats Avg Stdev Max +/- Stdev Latency 8.45ms 8.56ms 327.08ms 92.21% Req/Sec 9.94k 4.52k 28.71k 73.51% Latency Distribution 50% 6.66ms 75% 9.82ms 90% 15.27ms 99% 34.71ms 729305 requests in 10.06s, 206.57MB read Socket errors: connect 0, read 1488, write 3, timeout 0 Requests/sec: 72481.25 Transfer/sec: 20.53MB -
Rust version:
recreate node project:
cargo new benchmark # Add the dependencies to `Cargo.toml`: tide = "~0.15" async-std = { version = "1.8.0", features = ["attributes"] } async-trait = "^0.1.41" serde = { version = "1.0", features = ["derive"] } serde_json = "1.0 # Build release version cargo build --bin benchmark_server --release./target/release/benchmark_server [ Benchmark Server Demo ] Benchmark Server is listening on: 0.0.0.0:8080
# `/` default route wrk --thread 8 --connections 5000 --duration 10s --latency http://127.0.0.1:8080/ Running 10s test @ http://127.0.0.1:8080/ 8 threads and 5000 connections Thread Stats Avg Stdev Max +/- Stdev Latency 10.16ms 6.56ms 228.77ms 69.81% Req/Sec 8.15k 5.77k 42.38k 88.54% Latency Distribution 50% 9.14ms 75% 14.20ms 90% 17.50ms 99% 23.26ms 577136 requests in 10.08s, 73.82MB read Socket errors: connect 0, read 1493, write 3, timeout 0 Requests/sec: 57266.86 Transfer/sec: 7.32MB# `/json-benchmark` route wrk --thread 8 --connections 5000 --duration 10s --latency http://127.0.0.1:8080/json-benchmark Running 10s test @ http://127.0.0.1:8080/json-benchmark 8 threads and 5000 connections Thread Stats Avg Stdev Max +/- Stdev Latency 11.39ms 5.93ms 216.63ms 66.60% Req/Sec 7.73k 3.47k 24.06k 67.97% Latency Distribution 50% 10.51ms 75% 15.89ms 90% 18.91ms 99% 22.02ms 573039 requests in 10.08s, 119.74MB read Socket errors: connect 0, read 1120, write 0, timeout 0 Requests/sec: 56862.43 Transfer/sec: 11.88MB
Maybe same to this issue #781
Maybe same to this issue #781
So it means I need to wait for the next fix? as your PR didn't merge yet.
Or any workaround I can use at this moment? plz :)
try this. Use this patch to your Cargo.toml.
[patch.crates-io]
tide = { git = 'https://github.com/fiag/tide.git', branch='tcp-nodelay' }
try this. Use this patch to your
Cargo.toml.[patch.crates-io] tide = { git = 'https://github.com/fiag/tide.git', branch='tcp-nodelay' }
Hi flag, thanks for the patch, and actually .... the result is quite funny...:)
After adding your patch to Cargo.tom, I run:
# Actually I also delete the `Cargo.Lock` file
cargo update
cargo clean && cargo build --bin benchmark_server --release
So I run the release version and test it again:
After that, test the node version again:
And here is the result, I took a screenshot and then align them together which easier to compare:
As you can see above, the Rust SHOULD fast than the node version, as the latency is low and balabala... which highlighted in the green colour. But somehow the node version can handle a lot of connections than the Rust one. That's why the Final result shows the node version got more throughput... (And btw, I use my iMac to run the above test, that's why the result is different from the very beginning when I created this issue which runs on my MacBookPro).
I've already considered that the Node version spawns a few child processes (even the ps command shows it got the same Thread amount with the rust binary), but tide use async-std which means it still spawns the same number of threads (with my CPU core amount), as the ASYNC_STD_THREAD_COUNT does that by default. Also, async-std uses the Rust Async model which should be efficient than the normal IPC which the node's cluster module use. ..... I just don't get that why the final result looks like that, could anyone try to have a try on this, plz :)
Also, I got another same situation and comparison result in my production service. I built a Binary Protocol Parser for encoding/decoding the hardware network data which transfers via TCP.
I made a performance test for both Typescripts (run in Node) and Rust version (release binary). The test very simple: Just run the decode function in a for loop to parse the same lines of binary protocol data (basically, just a brunch of byte[] / [u8]).
But the result is pretty funny which is the TypeScript one got more throughput than the Rust one. I consider that:
Maybe in every for loop scope, the Rust version always re-allocate and de-allocate all the local var memories (should be almost a few million operations during the test), as I saw the rust binary memory footprint can keep around 428KB and it's very stable.
But the node version, it uses around 32MB to run through the test (for getting that high throughput result). So I guess, the V8 never run the GC (to free the memory)? :)
Is that the same potential reason for the tide test result above?
@wisonye might be out of scope, but I'd be curious to see how other web frameworks written in rust perform here, and if they produce similar results. Might be out of scope, but it made me curious
Yup, it's not out of scope:) And I also want to see how other frameworks perform as well. If you have time then add a minimal demo here and see what happen:)
I found tide slower than springπ
Sorry, I've been too busy using Tide in prod to dig into this.
I can tell you from production experience: it is orders of magnitude faster than Node.js for common workloads.
autocannon against your node.js example:
autocannon 192.168.0.10:8080 -c 16 -W -w 8 -d 20
Running 20s warmup @ http://192.168.0.10:8080
16 connections
8 workers
Running 20s test @ http://192.168.0.10:8080
16 connections
8 workers
βββββββββββ¬βββββββ¬βββββββ¬ββββββββ¬βββββββ¬ββββββββββ¬ββββββββββ¬βββββββββ
β Stat β 2.5% β 50% β 97.5% β 99% β Avg β Stdev β Max β
βββββββββββΌβββββββΌβββββββΌββββββββΌβββββββΌββββββββββΌββββββββββΌβββββββββ€
β Latency β 0 ms β 1 ms β 1 ms β 1 ms β 0.61 ms β 1.46 ms β 155 ms β
βββββββββββ΄βββββββ΄βββββββ΄ββββββββ΄βββββββ΄ββββββββββ΄ββββββββββ΄βββββββββ
βββββββββββββ¬ββββββββββ¬ββββββββββ¬βββββββββ¬ββββββββββ¬ββββββββββ¬βββββββββ¬ββββββββββ
β Stat β 1% β 2.5% β 50% β 97.5% β Avg β Stdev β Min β
βββββββββββββΌββββββββββΌββββββββββΌβββββββββΌββββββββββΌββββββββββΌβββββββββΌββββββββββ€
β Req/Sec β 15199 β 15199 β 15535 β 17679 β 15782.4 β 644.56 β 15198 β
βββββββββββββΌββββββββββΌββββββββββΌβββββββββΌββββββββββΌββββββββββΌβββββββββΌββββββββββ€
β Bytes/Sec β 3.13 MB β 3.13 MB β 3.2 MB β 3.64 MB β 3.25 MB β 133 kB β 3.13 MB β
βββββββββββββ΄ββββββββββ΄ββββββββββ΄βββββββββ΄ββββββββββ΄ββββββββββ΄βββββββββ΄ββββββββββ
Req/Bytes counts sampled once per second.
316k requests in 20.05s, 65 MB read
Autocannon against Tide (--release):
autocannon 192.168.0.10:8080 -c 16 -W -w 8 -d 20
Running 20s warmup @ http://192.168.0.10:8080
16 connections
8 workers
Running 20s test @ http://192.168.0.10:8080
16 connections
8 workers
βββββββββββ¬βββββββ¬βββββββ¬ββββββββ¬βββββββ¬ββββββββββ¬βββββββββ¬βββββββββ
β Stat β 2.5% β 50% β 97.5% β 99% β Avg β Stdev β Max β
βββββββββββΌβββββββΌβββββββΌββββββββΌβββββββΌββββββββββΌβββββββββΌβββββββββ€
β Latency β 0 ms β 1 ms β 2 ms β 2 ms β 1.16 ms β 0.9 ms β 110 ms β
βββββββββββ΄βββββββ΄βββββββ΄ββββββββ΄βββββββ΄ββββββββββ΄βββββββββ΄βββββββββ
βββββββββββββ¬βββββββββ¬βββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬βββββββββ¬βββββββββ
β Stat β 1% β 2.5% β 50% β 97.5% β Avg β Stdev β Min β
βββββββββββββΌβββββββββΌβββββββββΌββββββββββΌββββββββββΌββββββββββΌβββββββββΌβββββββββ€
β Req/Sec β 8967 β 8967 β 9167 β 11063 β 9449.6 β 530.04 β 8966 β
βββββββββββββΌβββββββββΌβββββββββΌββββββββββΌββββββββββΌββββββββββΌβββββββββΌβββββββββ€
β Bytes/Sec β 1.2 MB β 1.2 MB β 1.23 MB β 1.48 MB β 1.27 MB β 71 kB β 1.2 MB β
βββββββββββββ΄βββββββββ΄βββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄βββββββββ΄βββββββββ
Req/Bytes counts sampled once per second.
189k requests in 20.04s, 25.3 MB read
That's kinda odd. It's definitely not what we observe but we also don't stress our Rust processes much (because they are plenty fast to carry our load).
Notes: This was done by running the benchmarker on my laptop (a slower machine) against the server examples on my desktop (a faster machine). Everything is wired together on gigabit ethernet.
Linux perf counter stats seem to indicate this is artificial (possibly TCP no_delay related):
Performance counter stats for 'node benchmark_server.js':
117,093.95 msec task-clock # 2.461 CPUs utilized
469,264 context-switches # 0.004 M/sec
86,414 cpu-migrations # 0.738 K/sec
102,170 page-faults # 0.873 K/sec
272,687,722,707 cycles # 2.329 GHz
122,404,642,364 instructions # 0.45 insn per cycle
25,706,131,492 branches # 219.534 M/sec
1,663,729,936 branch-misses # 6.47% of all branches
Performance counter stats for 'cargo run --release':
49,355.48 msec task-clock # 0.430 CPUs utilized
863,669 context-switches # 0.017 M/sec
23,368 cpu-migrations # 0.473 K/sec
7,929 page-faults # 0.161 K/sec
85,474,611,529 cycles # 1.732 GHz
43,426,351,109 instructions # 0.51 insn per cycle
8,534,130,491 branches # 172.912 M/sec
529,585,015 branch-misses # 6.21% of all branches
Of note there, Tide does a bunch more context switching, but it's not too bad, I think.
Tide however uses less than a third of the cpu cycles.
With tcp no_delay enabled via @jbr's draft PR (https://github.com/http-rs/tide/pull/823) I get:
Running 20s warmup @ http://192.168.0.10:8080
16 connections
8 workers
Running 20s test @ http://192.168.0.10:8080
16 connections
8 workers
βββββββββββ¬βββββββ¬βββββββ¬ββββββββ¬βββββββ¬ββββββββββ¬ββββββββββ¬βββββββββ
β Stat β 2.5% β 50% β 97.5% β 99% β Avg β Stdev β Max β
βββββββββββΌβββββββΌβββββββΌββββββββΌβββββββΌββββββββββΌββββββββββΌβββββββββ€
β Latency β 0 ms β 0 ms β 1 ms β 2 ms β 0.53 ms β 0.98 ms β 130 ms β
βββββββββββ΄βββββββ΄βββββββ΄ββββββββ΄βββββββ΄ββββββββββ΄ββββββββββ΄βββββββββ
βββββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
β Stat β 1% β 2.5% β 50% β 97.5% β Avg β Stdev β Min β
βββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β Req/Sec β 12887 β 12887 β 14711 β 19167 β 15810.6 β 2260.19 β 12885 β
βββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β Bytes/Sec β 1.73 MB β 1.73 MB β 1.97 MB β 2.57 MB β 2.12 MB β 303 kB β 1.73 MB β
βββββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
Req/Bytes counts sampled once per second.
316k requests in 20.03s, 42.4 MB read
Which is about on-par. I think my laptop is now the limiting factor. I'll try to run the benchmark in reverse.
Also, for that last example, we're still using only about half the cpu cycles for the same number of requests as Node.
Performance counter stats for 'cargo run --release':
55,935.65 msec task-clock # 0.565 CPUs utilized
1,361,880 context-switches # 0.024 M/sec
22,762 cpu-migrations # 0.407 K/sec
7,881 page-faults # 0.141 K/sec
126,799,597,556 cycles # 2.267 GHz
71,088,504,957 instructions # 0.56 insn per cycle
13,899,982,207 branches # 248.500 M/sec
583,560,089 branch-misses # 4.20% of all branches
I am going to caution that no_delay may be ideal for this benchmarking workload but may not be ideal in the real world.
I am going to caution that no_delay may be ideal for this benchmarking workload but may not be ideal in the real world.
I run tfb with TCP_NODELAY, it has a big improvement in Req/Sec, but the Latency increased a lot.
**TCP_NODELAY**
8 threads and 8 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 644.08ms 1.07s 3.67s 81.67%
Req/Sec 12.16k 4.52k 14.46k 87.83%
Comparing with..
**tide = "0.16.0"**
8 threads and 8 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 15.47ms 16.24ms 45.97ms 80.29%
Req/Sec 93.20 32.76 272.00 68.75%
The result looks really strange. I'm not familiar with this topic, but I think the following link will help. https://stackoverflow.com/questions/3761276/when-should-i-use-tcp-nodelay-and-when-tcp-cork
@wisonye might be out of scope, but I'd be curious to see how other web frameworks written in rust perform here, and if they produce similar results. Might be out of scope, but it made me curious
Also currently warp has a more satisfying result on my computer.
**warp**
8 threads and 8 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 497.54us 550.14us 18.32ms 97.70%
Req/Sec 2.02k 309.63 2.96k 78.77%
It's understandably microbenchmark doesn't reflect the whole real word. But the result can't persuade me that tide is good enough like other web app framework.
@Fishrock123 @wisonye @slhmy Are there any tools to inspect where time is spent when the server is running, might offer some clues as to what you were seeing @wisonye
@Fishrock123 @wisonye @slhmy Are there any tools to inspect where time is spent when the server is running, might offer some clues as to what you were seeing @wisonye
Don't know if flamegraph(https://github.com/flamegraph-rs/flamegraph) will help... I'm kind of busy nowadays.
I want to take bottom-up perf stacks but don't know how to offhand with rust (and I am super busy).
autocannon to benchmark_server.js
β― autocannon 192.168.100.108:8080 -c 16 -W -w 8 -d 20 (base)
Running 20s warmup @ http://192.168.100.108:8080
16 connections
8 workers
Running 20s test @ http://192.168.100.108:8080
16 connections
8 workers
βββββββββββ¬βββββββ¬βββββββ¬ββββββββ¬βββββββ¬ββββββββββ¬ββββββββββ¬βββββββββ
β Stat β 2.5% β 50% β 97.5% β 99% β Avg β Stdev β Max β
βββββββββββΌβββββββΌβββββββΌββββββββΌβββββββΌββββββββββΌββββββββββΌβββββββββ€
β Latency β 0 ms β 0 ms β 0 ms β 1 ms β 0.34 ms β 7.63 ms β 376 ms β
βββββββββββ΄βββββββ΄βββββββ΄ββββββββ΄βββββββ΄ββββββββββ΄ββββββββββ΄βββββββββ
βββββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
β Stat β 1% β 2.5% β 50% β 97.5% β Avg β Stdev β Min β
βββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β Req/Sec β 16879 β 16879 β 32895 β 39871 β 31267.6 β 5363.64 β 16871 β
βββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β Bytes/Sec β 3.09 MB β 3.09 MB β 6.02 MB β 7.29 MB β 5.72 MB β 981 kB β 3.09 MB β
βββββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
Req/Bytes counts sampled once per second.
625k requests in 20.21s, 114 MB read
autocanon tide --release, with TCP_NODELAY
β― autocannon 192.168.100.108:8080 -c 16 -W -w 8 -d 20 (base)
Running 20s warmup @ http://192.168.100.108:8080
16 connections
8 workers
Running 20s test @ http://192.168.100.108:8080
16 connections
8 workers
βββββββββββ¬βββββββ¬βββββββ¬ββββββββ¬βββββββ¬ββββββββββ¬ββββββββββ¬βββββββββ
β Stat β 2.5% β 50% β 97.5% β 99% β Avg β Stdev β Max β
βββββββββββΌβββββββΌβββββββΌββββββββΌβββββββΌββββββββββΌββββββββββΌβββββββββ€
β Latency β 0 ms β 0 ms β 0 ms β 1 ms β 0.07 ms β 1.68 ms β 255 ms β
βββββββββββ΄βββββββ΄βββββββ΄ββββββββ΄βββββββ΄ββββββββββ΄ββββββββββ΄βββββββββ
βββββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
β Stat β 1% β 2.5% β 50% β 97.5% β Avg β Stdev β Min β
βββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β Req/Sec β 20015 β 20015 β 45023 β 47871 β 43102.4 β 6650.74 β 20001 β
βββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β Bytes/Sec β 2.68 MB β 2.68 MB β 6.03 MB β 6.41 MB β 5.78 MB β 891 kB β 2.68 MB β
βββββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
Req/Bytes counts sampled once per second.
862k requests in 20.01s, 116 MB read
And make a flamegraph. flamegraph.svg.zip
I am going to caution that no_delay may be ideal for this benchmarking workload but may not be ideal in the real world.
I run
tfbwith TCP_NODELAY, it has a big improvement in Req/Sec, but the Latency increased a lot.**TCP_NODELAY** 8 threads and 8 connections Thread Stats Avg Stdev Max +/- Stdev Latency 644.08ms 1.07s 3.67s 81.67% Req/Sec 12.16k 4.52k 14.46k 87.83%Comparing with..
**tide = "0.16.0"** 8 threads and 8 connections Thread Stats Avg Stdev Max +/- Stdev Latency 15.47ms 16.24ms 45.97ms 80.29% Req/Sec 93.20 32.76 272.00 68.75%The result looks really strange. I'm not familiar with this topic, but I think the following link will help. https://stackoverflow.com/questions/3761276/when-should-i-use-tcp-nodelay-and-when-tcp-cork
@wisonye might be out of scope, but I'd be curious to see how other web frameworks written in rust perform here, and if they produce similar results. Might be out of scope, but it made me curious
Also currently warp has a more satisfying result on my computer.
**warp** 8 threads and 8 connections Thread Stats Avg Stdev Max +/- Stdev Latency 497.54us 550.14us 18.32ms 97.70% Req/Sec 2.02k 309.63 2.96k 78.77%It's understandably microbenchmark doesn't reflect the whole real word. But the result can't persuade me that tide is good enough like other web app framework.
@slhmy
Hey, sorry for late, so busy nowadays. And YES, you're right I think. Actually, I think the choice belongs to between async-std and tokio, as that's a bigger difference under the hood:) It very depends :)
autocannon to benchmark_server.js
β― autocannon 192.168.100.108:8080 -c 16 -W -w 8 -d 20 (base) Running 20s warmup @ http://192.168.100.108:8080 16 connections 8 workers Running 20s test @ http://192.168.100.108:8080 16 connections 8 workers βββββββββββ¬βββββββ¬βββββββ¬ββββββββ¬βββββββ¬ββββββββββ¬ββββββββββ¬βββββββββ β Stat β 2.5% β 50% β 97.5% β 99% β Avg β Stdev β Max β βββββββββββΌβββββββΌβββββββΌββββββββΌβββββββΌββββββββββΌββββββββββΌβββββββββ€ β Latency β 0 ms β 0 ms β 0 ms β 1 ms β 0.34 ms β 7.63 ms β 376 ms β βββββββββββ΄βββββββ΄βββββββ΄ββββββββ΄βββββββ΄ββββββββββ΄ββββββββββ΄βββββββββ βββββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ β Stat β 1% β 2.5% β 50% β 97.5% β Avg β Stdev β Min β βββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€ β Req/Sec β 16879 β 16879 β 32895 β 39871 β 31267.6 β 5363.64 β 16871 β βββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€ β Bytes/Sec β 3.09 MB β 3.09 MB β 6.02 MB β 7.29 MB β 5.72 MB β 981 kB β 3.09 MB β βββββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ Req/Bytes counts sampled once per second. 625k requests in 20.21s, 114 MB readautocanon tide --release, with TCP_NODELAY
β― autocannon 192.168.100.108:8080 -c 16 -W -w 8 -d 20 (base) Running 20s warmup @ http://192.168.100.108:8080 16 connections 8 workers Running 20s test @ http://192.168.100.108:8080 16 connections 8 workers βββββββββββ¬βββββββ¬βββββββ¬ββββββββ¬βββββββ¬ββββββββββ¬ββββββββββ¬βββββββββ β Stat β 2.5% β 50% β 97.5% β 99% β Avg β Stdev β Max β βββββββββββΌβββββββΌβββββββΌββββββββΌβββββββΌββββββββββΌββββββββββΌβββββββββ€ β Latency β 0 ms β 0 ms β 0 ms β 1 ms β 0.07 ms β 1.68 ms β 255 ms β βββββββββββ΄βββββββ΄βββββββ΄ββββββββ΄βββββββ΄ββββββββββ΄ββββββββββ΄βββββββββ βββββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ β Stat β 1% β 2.5% β 50% β 97.5% β Avg β Stdev β Min β βββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€ β Req/Sec β 20015 β 20015 β 45023 β 47871 β 43102.4 β 6650.74 β 20001 β βββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€ β Bytes/Sec β 2.68 MB β 2.68 MB β 6.03 MB β 6.41 MB β 5.78 MB β 891 kB β 2.68 MB β βββββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ Req/Bytes counts sampled once per second. 862k requests in 20.01s, 116 MB readAnd make a flamegraph. flamegraph.svg.zip
@fiag .....That's funny:) But I remember that I did give it a try based on your patch branch for the TCP_NO_DELAY settings, and the result I got looks no much different. How it comes it looks like a very big difference when you use it? :)
@slhmy Hey, sorry for late, so busy nowadays. And YES, you're right I think. Actually, I think the choice belongs to between and , as that's a bigger difference under the hood:) It very depends :)
async-std``tokio
π¦ Maybe more comparisons need to be made.
I currently made actix-web work with sqlx (sqlx runs a tokio runtime which is compatible with actix-web 4.0-beta) and there is also some performance issue...(Check this issue. It is temporary solved by making querying in one connection)
I also found there is a huge performance loss if I put a async-sever into a docker machine.
Combining to the above relate, I guess maybe async-std consumes a lot of time to switch between threads, but I can't make flamegraph for computer reason...So, it's only my guess. π
@slhmy Thanks for that:) Also, here is my personal opinion:
-
(Check this issue. It is temporary solved by making querying in one connection)
I did check that issue and I think
jplatteanswer is good. As usual, thePooljust responsible for **getting an exists and free connection back to you when you ask for; If not one exists, then create a new connection instance and cache it (in the hashmap usually). What you were trying to do is ask thepoolto give you the new one and use it immediately (before it releases which more like none of free connection instance in thepool). I think that's why that "bug" shows up, just guess :) -
I also found there is a huge performance loss if I put a async-sever into a docker machine.
I did use
async-stdin production and that high-performance TCP server is running inside docker swarm as well which I didn't see any slow issue. So what's your case actually?:) -
I guess maybe async-std consumes a lot of time to switch between threads
I think you can go to
async-stddiscord channel to ask them, as by default just spawn a few threads which based on how much CPU core you have, should not be a very big problem for switching between the threads (I think), and I should works like asking for afree OS/Native threadfrom the internal thread pool:)
@wisonye
- I also found there is a huge performance loss if I put a async-sever into a docker machine. I did use in production and that high-performance TCP server is running inside docker swarm as well which I didn't see any slow issue. So what's your case actually?:)
async-std
Thanks a lot for your help, actually it''s related the issue I have post.
I run the service in a docker container, and actix+sqlx will cost more than 20s to request 500 rows in the database. However others like tide+sqlx will not(only cost around 300ms).
The 20s problem only appears in docker machine build by tfb debug mode (tfb debug mode will automatically run two containers one for database and one for the server), while the server is not in docker it won't cost so much time.
Any way I also think your opinion is correct, so I will do more try and exclude all my personal issue if I could, then the guess may come into a result.