nim-chronos
nim-chronos copied to clipboard
[WIP] plaintext benchmark
- [x] finishing bench bot implementation
- [x] add participant: rust actix
- [x] add participant: go-lang fasthttp
- [x] add participant: c libreactor
- [x] polishing benchmark report
- [x] completing thread/nothread stuff
- [ ] rewrite response section code
- [x] add nodocker script
I'm sorry @jangko, there no reason to make benchmarks on multi-threaded vs single-threaded apps. Could you please try to limit number of processes/threads used by chosen framework?
Could you please try to limit number of processes/threads used by chosen framework?
sure, i will try to make the benchmark as fair as possible for each participant
From my tests on VM with just only 2 processors available, mofuw
is not so performant, and also it produces more errors, not successful responses:
Running 10s test @ http://127.0.0.1:34500
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.02ms 4.57ms 61.20ms 97.04%
Req/Sec 17.78k 6.53k 32.87k 59.20%
355475 requests in 10.10s, 44.75MB read
Non-2xx or 3xx responses: 355475
Requests/sec: 35197.01
Transfer/sec: 4.43MB
./wrk http://127.0.0.1:34500 1.02s user 4.18s system 51% cpu 10.105 total
cheatfate@phantom ~/wrk (git)-[master] % ./wrk http://127.0.0.1:34500
Running 10s test @ http://127.0.0.1:34500
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.12ms 14.53ms 151.12ms 95.69%
Req/Sec 17.17k 2.89k 20.92k 82.67%
345100 requests in 10.10s, 43.44MB read
Non-2xx or 3xx responses: 345100
Requests/sec: 34168.76
Transfer/sec: 4.30MB
./wrk http://127.0.0.1:34500 1.13s user 4.13s system 51% cpu 10.110 total
cheatfate@phantom ~/wrk (git)-[master] % ./wrk http://127.0.0.1:34500
Running 10s test @ http://127.0.0.1:34500
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 4.55ms 22.19ms 220.08ms 95.90%
Req/Sec 16.85k 2.76k 31.98k 82.59%
337011 requests in 10.10s, 42.42MB read
Non-2xx or 3xx responses: 337011
Requests/sec: 33366.97
Transfer/sec: 4.20MB
./wrk http://127.0.0.1:34500 1.23s user 3.78s system 49% cpu 10.145 total
While on the same VM asyncdispatch2 benchmark produces such output:
cheatfate@phantom ~/wrk (git)-[master] % ./wrk http://127.0.0.1:8885
Running 10s test @ http://127.0.0.1:8885
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 208.75us 184.01us 10.27ms 98.49%
Req/Sec 24.09k 3.90k 28.10k 77.23%
484088 requests in 10.10s, 24.93MB read
Requests/sec: 47929.24
Transfer/sec: 2.47MB
./wrk http://127.0.0.1:8885 1.06s user 5.18s system 61% cpu 10.104 total
cheatfate@phantom ~/wrk (git)-[master] % ./wrk http://127.0.0.1:8885
Running 10s test @ http://127.0.0.1:8885
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 182.87us 107.18us 6.93ms 97.22%
Req/Sec 26.52k 1.07k 29.06k 59.00%
527788 requests in 10.01s, 27.18MB read
Requests/sec: 52746.72
Transfer/sec: 2.72MB
./wrk http://127.0.0.1:8885 1.48s user 5.84s system 73% cpu 10.009 total
cheatfate@phantom ~/wrk (git)-[master] % ./wrk http://127.0.0.1:8885
Running 10s test @ http://127.0.0.1:8885
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 218.96us 240.94us 10.37ms 98.65%
Req/Sec 23.38k 4.37k 28.07k 70.79%
469746 requests in 10.10s, 24.19MB read
Requests/sec: 46510.70
Transfer/sec: 2.40MB
./wrk http://127.0.0.1:8885 1.05s user 4.96s system 59% cpu 10.103 total
As you can see here, there no Non-2xx or 3xx responses: 337011
. So normal HTTP answers got received by wrk
.
mofuw need /plaintext
uri to avoid Non-2xx or 3xx responses
The performance difference between your benchmark and mine is because I use pipeline switch turn on. When the pipeline switch added to wrk, mofuw performance will be higher than ad2.
@jangko, from what i see asyncdispatch
and asyncdispatch2
benchmarks - do not support pipeline
messages. So why you are testing it?
Most of performant techempower benchmark participant are designed to handle pipeline messages. On the other hand, this benchmark does not take it into account. While testing those frameworks, I realize their performance can vary significantly with/without pipeline mode. I think it would be important to keep this information. The final result of this benchmark will include both pipeline and no-pipeline mode for comparison, or it will become bench-bot switchable feature. What we can do now is make them all run in single thread mode. Then we can decide what we will need to do with this pipeline.
But you can adjust benchmark source to support pipeline for both asyncdispatch
and asyncdispatch2
.
But you can adjust benchmark source to support pipeline for both asyncdispatch and asyncdispatch2.
agree
ad2 is fast but then suffer massive slowdown, hmm. interesting
and benchmarks tend to take a while so it will slow down every PR / build roundtrip..
That's right, it took significant amount of time. I already removed it from CI.
Summary
- mofuw, mofuw use asyncdispatch, expected performance should not more than asynchdispatch itself.
- asyncdispatch, although it is slower than asyncdispatch2, it can handle high concurrency quite well.
- asyncdispatch2, at high concurrency it has tendency become slower significantly, but surpringsingly it is the only framework in this test that can handle non pipeline request faster than other frameworks although using almost identical code with asyncdispatch when handle request/response.
- actix-raw, very fast when multi threaded, not so when single threaded.
- fasthttp, very fast when multi threaded, not so when single threaded.
- libreactor, still very fast although in single thread mode.
Conclusion
- asyncdispatch2 could be a good candidate to replace asycndispatch
- it still has room for improvement especially when handle high count connections.
Sorry I cannot work faster because of some circumstances, but I think this one is ready for review.
looks like asyncdispatch2
benchmark has broken implementation at least on Mac OS: it generates ~10x responses for the same request(provided results have a similar correlation)
wrk is going crazy in that way:
wrk -c 30 -d 15s -t 4 http://localhost:8080/
Running 15s test @ http://localhost:8080/
4 threads and 30 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 204.63us 122.29us 3.45ms 66.65%
Req/Sec 329.58k 22.97k 376.98k 63.91%
19802436 requests in 15.10s, 2.56GB read
Requests/sec: 1311431.24
Transfer/sec: 173.84MB
@jangko, I've pushed to your branch a commit adding a command-line option for deciding whether threads should be used. To support it, the test programs need a minor modification - they must check whether the environment variable USE_THREADS
is set. You can see an example here:
https://github.com/status-im/nim-asyncdispatch2/pull/9/commits/4fa3b6e3c7096cc72c33264c4174abe8334c064b#diff-a700604e55a2b00d28959045bdda5b09R26
I've added this to the Rust and Go programs, but can we also add it to the rest of the examples?
The asyncdispatch2 program that you've prepared is violating the rules of the competition, which are given here: https://www.techempower.com/benchmarks/#section=code
In particular, this rule:
This test is not intended to exercise the allocation of memory or instantiation of objects. Therefore it is acceptable but not required to re-use a single buffer for the response text (Hello, World). However, the response must be fully composed from the response text and response headers within the scope of each request and it is not acceptable to store the entire payload of the response, or an unnaturally large subset of the response, headers inclusive, as a pre-rendered buffer. "Buffer" here refers to a byte array, byte buffer, character array, character buffer, string, or string-like data structure. The spirit of the test is to require the construction of the HTTP response as is typically done by a framework or platform via concatenation of strings or similar. For example, pre-rendering a buffer with HTTP/1.1 200 OK
Content-length: 15 Server: Example would not be acceptable.
So, you must break up a bit the strings being written as a response. I think you can avoid some of the allocations and concatenations as well, @cheatfate may provide some hints for what is the most efficient way to build the response piece by piece.