nim-chronos icon indicating copy to clipboard operation
nim-chronos copied to clipboard

[WIP] plaintext benchmark

Open jangko opened this issue 5 years ago • 13 comments

  • [x] finishing bench bot implementation
  • [x] add participant: rust actix
  • [x] add participant: go-lang fasthttp
  • [x] add participant: c libreactor
  • [x] polishing benchmark report
  • [x] completing thread/nothread stuff
  • [ ] rewrite response section code
  • [x] add nodocker script

jangko avatar Aug 22 '18 15:08 jangko

I'm sorry @jangko, there no reason to make benchmarks on multi-threaded vs single-threaded apps. Could you please try to limit number of processes/threads used by chosen framework?

cheatfate avatar Aug 27 '18 18:08 cheatfate

Could you please try to limit number of processes/threads used by chosen framework?

sure, i will try to make the benchmark as fair as possible for each participant

jangko avatar Aug 28 '18 01:08 jangko

From my tests on VM with just only 2 processors available, mofuw is not so performant, and also it produces more errors, not successful responses:

Running 10s test @ http://127.0.0.1:34500
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.02ms    4.57ms  61.20ms   97.04%
    Req/Sec    17.78k     6.53k   32.87k    59.20%
  355475 requests in 10.10s, 44.75MB read
  Non-2xx or 3xx responses: 355475
Requests/sec:  35197.01
Transfer/sec:      4.43MB
./wrk http://127.0.0.1:34500  1.02s user 4.18s system 51% cpu 10.105 total
cheatfate@phantom ~/wrk (git)-[master] % ./wrk http://127.0.0.1:34500
Running 10s test @ http://127.0.0.1:34500
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.12ms   14.53ms 151.12ms   95.69%
    Req/Sec    17.17k     2.89k   20.92k    82.67%
  345100 requests in 10.10s, 43.44MB read
  Non-2xx or 3xx responses: 345100
Requests/sec:  34168.76
Transfer/sec:      4.30MB
./wrk http://127.0.0.1:34500  1.13s user 4.13s system 51% cpu 10.110 total
cheatfate@phantom ~/wrk (git)-[master] % ./wrk http://127.0.0.1:34500
Running 10s test @ http://127.0.0.1:34500
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.55ms   22.19ms 220.08ms   95.90%
    Req/Sec    16.85k     2.76k   31.98k    82.59%
  337011 requests in 10.10s, 42.42MB read
  Non-2xx or 3xx responses: 337011
Requests/sec:  33366.97
Transfer/sec:      4.20MB
./wrk http://127.0.0.1:34500  1.23s user 3.78s system 49% cpu 10.145 total

cheatfate avatar Aug 28 '18 09:08 cheatfate

While on the same VM asyncdispatch2 benchmark produces such output:

cheatfate@phantom ~/wrk (git)-[master] % ./wrk http://127.0.0.1:8885
Running 10s test @ http://127.0.0.1:8885
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   208.75us  184.01us  10.27ms   98.49%
    Req/Sec    24.09k     3.90k   28.10k    77.23%
  484088 requests in 10.10s, 24.93MB read
Requests/sec:  47929.24
Transfer/sec:      2.47MB
./wrk http://127.0.0.1:8885  1.06s user 5.18s system 61% cpu 10.104 total
cheatfate@phantom ~/wrk (git)-[master] % ./wrk http://127.0.0.1:8885
Running 10s test @ http://127.0.0.1:8885
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   182.87us  107.18us   6.93ms   97.22%
    Req/Sec    26.52k     1.07k   29.06k    59.00%
  527788 requests in 10.01s, 27.18MB read
Requests/sec:  52746.72
Transfer/sec:      2.72MB
./wrk http://127.0.0.1:8885  1.48s user 5.84s system 73% cpu 10.009 total
cheatfate@phantom ~/wrk (git)-[master] % ./wrk http://127.0.0.1:8885
Running 10s test @ http://127.0.0.1:8885
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   218.96us  240.94us  10.37ms   98.65%
    Req/Sec    23.38k     4.37k   28.07k    70.79%
  469746 requests in 10.10s, 24.19MB read
Requests/sec:  46510.70
Transfer/sec:      2.40MB
./wrk http://127.0.0.1:8885  1.05s user 4.96s system 59% cpu 10.103 total

As you can see here, there no Non-2xx or 3xx responses: 337011. So normal HTTP answers got received by wrk.

cheatfate avatar Aug 28 '18 09:08 cheatfate

mofuw need /plaintext uri to avoid Non-2xx or 3xx responses

The performance difference between your benchmark and mine is because I use pipeline switch turn on. When the pipeline switch added to wrk, mofuw performance will be higher than ad2.

jangko avatar Aug 29 '18 05:08 jangko

@jangko, from what i see asyncdispatch and asyncdispatch2 benchmarks - do not support pipeline messages. So why you are testing it?

cheatfate avatar Aug 29 '18 07:08 cheatfate

Most of performant techempower benchmark participant are designed to handle pipeline messages. On the other hand, this benchmark does not take it into account. While testing those frameworks, I realize their performance can vary significantly with/without pipeline mode. I think it would be important to keep this information. The final result of this benchmark will include both pipeline and no-pipeline mode for comparison, or it will become bench-bot switchable feature. What we can do now is make them all run in single thread mode. Then we can decide what we will need to do with this pipeline.

jangko avatar Aug 29 '18 09:08 jangko

But you can adjust benchmark source to support pipeline for both asyncdispatch and asyncdispatch2.

cheatfate avatar Aug 29 '18 09:08 cheatfate

But you can adjust benchmark source to support pipeline for both asyncdispatch and asyncdispatch2.

agree

jangko avatar Aug 29 '18 09:08 jangko

ad2 is fast but then suffer massive slowdown, hmm. interesting

jangko avatar Sep 03 '18 05:09 jangko

and benchmarks tend to take a while so it will slow down every PR / build roundtrip..

That's right, it took significant amount of time. I already removed it from CI.


Summary

  • mofuw, mofuw use asyncdispatch, expected performance should not more than asynchdispatch itself.
  • asyncdispatch, although it is slower than asyncdispatch2, it can handle high concurrency quite well.
  • asyncdispatch2, at high concurrency it has tendency become slower significantly, but surpringsingly it is the only framework in this test that can handle non pipeline request faster than other frameworks although using almost identical code with asyncdispatch when handle request/response.
  • actix-raw, very fast when multi threaded, not so when single threaded.
  • fasthttp, very fast when multi threaded, not so when single threaded.
  • libreactor, still very fast although in single thread mode.

Conclusion

  • asyncdispatch2 could be a good candidate to replace asycndispatch
  • it still has room for improvement especially when handle high count connections.

Sorry I cannot work faster because of some circumstances, but I think this one is ready for review.

jangko avatar Sep 11 '18 12:09 jangko

looks like asyncdispatch2 benchmark has broken implementation at least on Mac OS: it generates ~10x responses for the same request(provided results have a similar correlation)

wrk is going crazy in that way:

wrk -c 30 -d 15s -t 4 http://localhost:8080/
Running 15s test @ http://localhost:8080/
  4 threads and 30 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   204.63us  122.29us   3.45ms   66.65%
    Req/Sec   329.58k    22.97k  376.98k    63.91%
  19802436 requests in 15.10s, 2.56GB read
Requests/sec: 1311431.24
Transfer/sec:    173.84MB

dm1try avatar Sep 15 '18 17:09 dm1try

@jangko, I've pushed to your branch a commit adding a command-line option for deciding whether threads should be used. To support it, the test programs need a minor modification - they must check whether the environment variable USE_THREADS is set. You can see an example here:

https://github.com/status-im/nim-asyncdispatch2/pull/9/commits/4fa3b6e3c7096cc72c33264c4174abe8334c064b#diff-a700604e55a2b00d28959045bdda5b09R26

I've added this to the Rust and Go programs, but can we also add it to the rest of the examples?

The asyncdispatch2 program that you've prepared is violating the rules of the competition, which are given here: https://www.techempower.com/benchmarks/#section=code

In particular, this rule:

This test is not intended to exercise the allocation of memory or instantiation of objects. Therefore it is acceptable but not required to re-use a single buffer for the response text (Hello, World). However, the response must be fully composed from the response text and response headers within the scope of each request and it is not acceptable to store the entire payload of the response, or an unnaturally large subset of the response, headers inclusive, as a pre-rendered buffer. "Buffer" here refers to a byte array, byte buffer, character array, character buffer, string, or string-like data structure. The spirit of the test is to require the construction of the HTTP response as is typically done by a framework or platform via concatenation of strings or similar. For example, pre-rendering a buffer with HTTP/1.1 200 OKContent-length: 15Server: Example would not be acceptable.

So, you must break up a bit the strings being written as a response. I think you can avoid some of the allocations and concatenations as well, @cheatfate may provide some hints for what is the most efficient way to build the response piece by piece.

zah avatar Sep 21 '18 17:09 zah