guv icon indicating copy to clipboard operation
guv copied to clipboard

Bad performance

Open catroot opened this issue 8 years ago • 20 comments

Tried stress load with wrk... and got:

wrk -t 1 -c 1000 -d 25s http://suite:5001/app2url --latency
Running 25s test @ http://suite:5001/app2url
  1 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.41s   431.98ms   1.89s    62.33%
    Req/Sec   449.79     40.89   480.00     95.14%
  Latency Distribution
     50%    1.47s
     75%    1.86s
     90%    1.88s
     99%    1.89s
  11104 requests in 25.01s, 2.19MB read
  Socket errors: connect 0, read 0, write 0, timeout 7815
Requests/sec:    444.03
Transfer/sec:     89.76KB

Why so ugly?

catroot avatar Jul 01 '16 10:07 catroot

Can you post your code?

veegee avatar Jul 01 '16 11:07 veegee

Ofcourse, there is nothing special.

from tg import expose, TGController, AppConfig
import guv; guv.monkey_patch()
import guv.wsgi



class RootController(TGController):
     @expose()
     def app2url(self):
         return "<h1>Hello World</h1>"



if __name__ == '__main__':
    config = AppConfig(minimal=True, root_controller=RootController())
    app=config.make_wsgi_app()
    server_sock = guv.listen(('0.0.0.0', 5001))
    guv.wsgi.serve(server_sock, app)

catroot avatar Jul 01 '16 11:07 catroot

Try putting import guv; guv.monkey_patch() at the very top, on the first line. It should be the first thing that runs so it can patch everything before anything else is imported.

veegee avatar Jul 01 '16 12:07 veegee

Ok. No behaviour impact.

 wrk -t 1 -c 1000 -d 25s http://suite:5001/app2url --latency
Running 25s test @ http://suite:5001/app2url
  1 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.53s   443.65ms   2.00s    86.24%
    Req/Sec   351.67     45.21   383.00     96.77%
  Latency Distribution
     50%    1.62s
     75%    1.96s
     90%    1.97s
     99%    1.98s
  8723 requests in 25.10s, 1.49MB read
  Socket errors: connect 0, read 0, write 0, timeout 6426
Requests/sec:    347.52
Transfer/sec:     60.75KB

catroot avatar Jul 01 '16 13:07 catroot

I just tested:

Python 3.5 example:

import guv; guv.monkey_patch()
import guv.wsgi
import bottle

@bottle.route('/')
def index():
    return 'Hello, world!'

app = bottle.default_app()

if __name__ == '__main__':
    server_sock = guv.listen(('0.0.0.0', 8001))
    guv.wsgi.serve(server_sock, app)
[veegee@vserver] ~/tmp
➜ wrk -c 200 http://localhost:8001/
Running 10s test @ http://localhost:8001/
  2 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    39.93ms    3.33ms  94.86ms   95.89%
    Req/Sec     2.51k   207.54     2.74k    92.00%
  50006 requests in 10.04s, 6.15MB read
Requests/sec:   4982.90
Transfer/sec:    627.73KB

In comparison, gevent:

from gevent import monkey; monkey.patch_all()
import bottle
from bottle import run

@bottle.route('/')
def index():
    return 'Hello, world!'

app = bottle.default_app()

if __name__ == '__main__':
    run(host='0.0.0.0', port=8001, server='gevent')
[veegee@vserver] ~/tmp
➜ wrk -c 200 http://localhost:8001/
Running 10s test @ http://localhost:8001/
  2 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   154.70ms  282.62ms   1.93s    84.64%
    Req/Sec     2.90k     1.57k    4.38k    77.21%
  41870 requests in 10.06s, 5.15MB read
  Socket errors: connect 0, read 0, write 0, timeout 199
Requests/sec:   4163.30
Transfer/sec:    524.48KB

Gevent seems to have some trouble here as 199 requests timed out.

TurboGears2, Python 3.5

from tg import expose, TGController, AppConfig
import guv; guv.monkey_patch()
import guv.wsgi


class RootController(TGController):
    @expose()
    def index(self):
        return 'Hello, world!'


if __name__ == '__main__':
    config = AppConfig(minimal=True, root_controller=RootController())
    app = config.make_wsgi_app()
    server_sock = guv.listen(('0.0.0.0', 8001))
    guv.wsgi.serve(server_sock, app)
[veegee@vserver] ~/tmp
➜ wrk -c 200 http://localhost:8001/
Running 10s test @ http://localhost:8001/
  2 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    61.50ms    6.92ms 117.49ms   92.95%
    Req/Sec     1.63k   251.33     2.02k    67.50%
  32434 requests in 10.04s, 5.32MB read
Requests/sec:   3231.16
Transfer/sec:    542.73KB

Looks like turbogears2 reduces throughput significantly, so if you have 1,000 concurrent requests, it's going to impact the latency. I optimized the WSGI server for low latency, but CPython is slow and there's nothing you can do about that. When pypy3 is stable, the numbers should be much better. But it shows that guv still beats gevent in latency and throughput.

Note that 1000 concurrent requests is a lot - unreasonable for any CPython application anywhere on a single server. Even for Scala and Go, 1,000 concurrent requests is a lot. You'll very likely never reasonably see that level of sustained throughput in production except maybe in very highly optimized systems.

Also TurboGears isn't the greatest web framework. If you can use Flask or Bottle, I highly recommend you go with those.

veegee avatar Jul 01 '16 13:07 veegee

Ok. I would make the same test with 2 threads and 200 connections:

wrk -t 2 -c 200 -d 25s http://suite:5001/app2url --latency
Running 25s test @ http://suite:5001/app2url
  2 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   556.89ms   60.67ms   1.08s    92.06%
    Req/Sec   203.64    119.65   373.00     55.03%
  Latency Distribution
     50%  553.00ms
     75%  560.62ms
     90%  565.81ms
     99%  679.04ms
  8893 requests in 25.07s, 1.52MB read
Requests/sec:    354.80
Transfer/sec:     62.02KB

And this is performance of gunicorn with eventlet single worker: gunicorn test-tg2:app -b 0.0.0.0:5001 -k eventlet -w 1

wrk -t 2 -c 200 -d 25s http://suite:5001/app2url --latency
Running 25s test @ http://suite:5001/app2url
  2 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    11.91ms  103.81ms   1.96s    99.16%
    Req/Sec   218.35    112.49   333.00     72.70%
  Latency Distribution
     50%    3.06ms
     75%    3.12ms
     90%    3.41ms
     99%   11.81ms
  7868 requests in 25.05s, 1.71MB read
  Socket errors: connect 0, read 133, write 0, timeout 422
Requests/sec:    314.08
Transfer/sec:     69.93KB

catroot avatar Jul 01 '16 13:07 catroot

Perhaps, there are some platform specific features used in libuv. My dual core AMD E-350 CPU works faster with eventlets for some reason.

catroot avatar Jul 01 '16 13:07 catroot

Something is very wrong if you're only getting 300 req/sec. I'm easily able to get 5,000 req/sec. What are the specs on your computer? Also, make sure you're not installing from pip; clone the repository and install that directly.

Also, can you try my bottle example above? Just do pip3 install -U --user bottle and then copy/paste the example and test it. Bottle has much lower overhead than the other frameworks so we can see if TurboGears2 is not playing nicely with guv.

I'm going to add the epoll backend back into guv, so we can try that instead of libuv. I need to update the epoll backend to support some of the newer features, so give me a day to do that.

veegee avatar Jul 01 '16 13:07 veegee

Also note that in your eventlet example, 422 requests timed out. In any production situation, if you have some requests going through and others timing out, that's not an acceptable tradeoff. If eventlet/gunicorn are doing some tricks to reduce latency and prioritize requests, that's worse than high latency overall.

veegee avatar Jul 01 '16 13:07 veegee

Here is a bottle performance for your code:

wrk -t 2 -c 200 -d 25s http://suite:5001/ --latency
Running 25s test @ http://suite:5001/
  2 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   341.16ms   33.51ms 573.45ms   91.33%
    Req/Sec   294.48    176.72   600.00     51.28%
  Latency Distribution
     50%  345.02ms
     75%  347.33ms
     90%  349.02ms
     99%  390.57ms
  14374 requests in 25.06s, 1.77MB read
Requests/sec:    573.67
Transfer/sec:     72.27KB

Ofcourse, bottle is much lighter than tg2, but if i increase concurrent requests count (to 1000) - there will be much worse measurement:

  2 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.70s   261.15ms   1.94s    92.37%
    Req/Sec   281.36    117.98   590.00     69.85%
  Latency Distribution
     50%    1.77s
     75%    1.78s
     90%    1.81s
     99%    1.86s
  13841 requests in 25.05s, 1.70MB read
  Socket errors: connect 0, read 0, write 0, timeout 140
Requests/sec:    552.64

Take a note on timeout requests.

catroot avatar Jul 01 '16 14:07 catroot

Yeah there's no way your little E-350 can handle 1,000 concurrent requests with CPython no matter what web framework/green thread library you use. Not even my big Core i7 5820k does well with 1,000 concurrent requests with CPython. I'm compiling pypy3 from source now because I'm curious how it does in this situation.

Also note that you're running in a single-process context. Normally, you'd use something like gunicorn (guv supports gunicorn very well) and set processes to the number of processor cores you have. That will even out the load and reduce latency significantly. The test you're going after is not really meaningful because you'll never see this kind of single-process load in production. Even if you do, the way to solve it is extremely simple - just add another frontend server or add more processing cores. At 1,000 concurrent requests, you've most certainly hit the limits of what CPython can do on your CPU. To give you some context, I get 100,000 requests/s with Scala or 350,000 req/s with Java/Undertow on the exact same computer as I did the tests on (using all processor cores, however). CPython is just slow.

The eventlet test is not meaningful because of the large number of socket read errors and timeouts. It's essentially measuring the latency on a subset of the requests - only those that didn't have errors or timeouts, which isn't fair. The guv WSGI server tries to handle all connections and requests fairly and try very hard not to drop any requests.

veegee avatar Jul 01 '16 14:07 veegee

Not a proplem to play with gunicorn. App code saved as test-bottle-guv.py:

import bottle

@bottle.route('/')
def index():
    return 'Hello, world!'

app = bottle.default_app()

and server started: gunicorn test-bottle-guv:app -b 0.0.0.0:5001 -k guv.GuvWorker -w 2 gives me:

Running 25s test @ http://suite:5001/
  2 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.28s   343.82ms   1.93s    65.33%
    Req/Sec   330.44    180.06   717.00     62.78%
  Latency Distribution
     50%    1.36s
     75%    1.42s
     90%    1.69s
     99%    1.77s
  16217 requests in 25.08s, 2.75MB read
  Socket errors: connect 0, read 0, write 0, timeout 121
Requests/sec:    646.57
Transfer/sec:    112.40KB

And also, note on timeouts.. (maybe this is normal? Caused by ending of test time, which came seens as a timeout?)

catroot avatar Jul 01 '16 14:07 catroot

Ok I'll try putting the epoll backend back into guv (I removed it a while ago because I thought I'd never need it again). I remember it performed very well, especially on pypy3. Perhaps it'll improve the latency.

Again, I'd suggest that you drop the concurrency down to 200 in your tests. 1,000 is very unreasonable for CPython and HTTP (it may be fine for websocket stuff though).

veegee avatar Jul 01 '16 14:07 veegee

You are definitely right, decreasing load duty will improve performance. But there are not what i`m looking for. See how eventlets handles 1000 connections with gunicorn at the same bottle:

Running 25s test @ http://suite:5001/
  2 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.93ms   19.94ms   1.56s    99.95%
    Req/Sec   395.07     77.05   828.00     92.40%
  Latency Distribution
     50%    2.37ms
     75%    2.57ms
     90%    3.10ms
     99%    7.09ms
  19174 requests in 25.10s, 3.26MB read
  Socket errors: connect 0, read 0, write 0, timeout 3
Requests/sec:    763.98
Transfer/sec:    132.81KB

catroot avatar Jul 01 '16 14:07 catroot

Ok yeah something is definitely wrong with the latency, I'll take a look.

veegee avatar Jul 01 '16 14:07 veegee

Thx!

catroot avatar Jul 01 '16 14:07 catroot

Is there any progress with this issue? Looks like it needs very deep digging to find out where this latency comes from.

catroot avatar Jul 06 '16 07:07 catroot

I didn't get a chance to look at it yet, but it doesn't require deep digging at all. The latency is primarily coming from the wsgi server, which is a modification of the gevent server. I'll try porting the eventlet wsgi server to guv and see what happens. If it's not that, then it can only be coming from the event loop. Guv isn't very deep and pretty easy to understand if you look at the core event loop and wsgi server code.

veegee avatar Jul 06 '16 14:07 veegee

I'll try porting the eventlet wsgi server to guv and see what happens

Any success with it?

catroot avatar Jul 23 '16 17:07 catroot

Hope, you didn't forget about trying another event loop.

catroot avatar Aug 16 '16 20:08 catroot