guv
guv copied to clipboard
Bad performance
Tried stress load with wrk... and got:
wrk -t 1 -c 1000 -d 25s http://suite:5001/app2url --latency
Running 25s test @ http://suite:5001/app2url
1 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.41s 431.98ms 1.89s 62.33%
Req/Sec 449.79 40.89 480.00 95.14%
Latency Distribution
50% 1.47s
75% 1.86s
90% 1.88s
99% 1.89s
11104 requests in 25.01s, 2.19MB read
Socket errors: connect 0, read 0, write 0, timeout 7815
Requests/sec: 444.03
Transfer/sec: 89.76KB
Why so ugly?
Can you post your code?
Ofcourse, there is nothing special.
from tg import expose, TGController, AppConfig
import guv; guv.monkey_patch()
import guv.wsgi
class RootController(TGController):
@expose()
def app2url(self):
return "<h1>Hello World</h1>"
if __name__ == '__main__':
config = AppConfig(minimal=True, root_controller=RootController())
app=config.make_wsgi_app()
server_sock = guv.listen(('0.0.0.0', 5001))
guv.wsgi.serve(server_sock, app)
Try putting import guv; guv.monkey_patch()
at the very top, on the first line. It should be the first thing that runs so it can patch everything before anything else is imported.
Ok. No behaviour impact.
wrk -t 1 -c 1000 -d 25s http://suite:5001/app2url --latency
Running 25s test @ http://suite:5001/app2url
1 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.53s 443.65ms 2.00s 86.24%
Req/Sec 351.67 45.21 383.00 96.77%
Latency Distribution
50% 1.62s
75% 1.96s
90% 1.97s
99% 1.98s
8723 requests in 25.10s, 1.49MB read
Socket errors: connect 0, read 0, write 0, timeout 6426
Requests/sec: 347.52
Transfer/sec: 60.75KB
I just tested:
Python 3.5 example:
import guv; guv.monkey_patch()
import guv.wsgi
import bottle
@bottle.route('/')
def index():
return 'Hello, world!'
app = bottle.default_app()
if __name__ == '__main__':
server_sock = guv.listen(('0.0.0.0', 8001))
guv.wsgi.serve(server_sock, app)
[veegee@vserver] ~/tmp
➜ wrk -c 200 http://localhost:8001/
Running 10s test @ http://localhost:8001/
2 threads and 200 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 39.93ms 3.33ms 94.86ms 95.89%
Req/Sec 2.51k 207.54 2.74k 92.00%
50006 requests in 10.04s, 6.15MB read
Requests/sec: 4982.90
Transfer/sec: 627.73KB
In comparison, gevent:
from gevent import monkey; monkey.patch_all()
import bottle
from bottle import run
@bottle.route('/')
def index():
return 'Hello, world!'
app = bottle.default_app()
if __name__ == '__main__':
run(host='0.0.0.0', port=8001, server='gevent')
[veegee@vserver] ~/tmp
➜ wrk -c 200 http://localhost:8001/
Running 10s test @ http://localhost:8001/
2 threads and 200 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 154.70ms 282.62ms 1.93s 84.64%
Req/Sec 2.90k 1.57k 4.38k 77.21%
41870 requests in 10.06s, 5.15MB read
Socket errors: connect 0, read 0, write 0, timeout 199
Requests/sec: 4163.30
Transfer/sec: 524.48KB
Gevent seems to have some trouble here as 199 requests timed out.
TurboGears2, Python 3.5
from tg import expose, TGController, AppConfig
import guv; guv.monkey_patch()
import guv.wsgi
class RootController(TGController):
@expose()
def index(self):
return 'Hello, world!'
if __name__ == '__main__':
config = AppConfig(minimal=True, root_controller=RootController())
app = config.make_wsgi_app()
server_sock = guv.listen(('0.0.0.0', 8001))
guv.wsgi.serve(server_sock, app)
[veegee@vserver] ~/tmp
➜ wrk -c 200 http://localhost:8001/
Running 10s test @ http://localhost:8001/
2 threads and 200 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 61.50ms 6.92ms 117.49ms 92.95%
Req/Sec 1.63k 251.33 2.02k 67.50%
32434 requests in 10.04s, 5.32MB read
Requests/sec: 3231.16
Transfer/sec: 542.73KB
Looks like turbogears2 reduces throughput significantly, so if you have 1,000 concurrent requests, it's going to impact the latency. I optimized the WSGI server for low latency, but CPython is slow and there's nothing you can do about that. When pypy3 is stable, the numbers should be much better. But it shows that guv still beats gevent in latency and throughput.
Note that 1000 concurrent requests is a lot - unreasonable for any CPython application anywhere on a single server. Even for Scala and Go, 1,000 concurrent requests is a lot. You'll very likely never reasonably see that level of sustained throughput in production except maybe in very highly optimized systems.
Also TurboGears isn't the greatest web framework. If you can use Flask or Bottle, I highly recommend you go with those.
Ok. I would make the same test with 2 threads and 200 connections:
wrk -t 2 -c 200 -d 25s http://suite:5001/app2url --latency
Running 25s test @ http://suite:5001/app2url
2 threads and 200 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 556.89ms 60.67ms 1.08s 92.06%
Req/Sec 203.64 119.65 373.00 55.03%
Latency Distribution
50% 553.00ms
75% 560.62ms
90% 565.81ms
99% 679.04ms
8893 requests in 25.07s, 1.52MB read
Requests/sec: 354.80
Transfer/sec: 62.02KB
And this is performance of gunicorn with eventlet single worker:
gunicorn test-tg2:app -b 0.0.0.0:5001 -k eventlet -w 1
wrk -t 2 -c 200 -d 25s http://suite:5001/app2url --latency
Running 25s test @ http://suite:5001/app2url
2 threads and 200 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 11.91ms 103.81ms 1.96s 99.16%
Req/Sec 218.35 112.49 333.00 72.70%
Latency Distribution
50% 3.06ms
75% 3.12ms
90% 3.41ms
99% 11.81ms
7868 requests in 25.05s, 1.71MB read
Socket errors: connect 0, read 133, write 0, timeout 422
Requests/sec: 314.08
Transfer/sec: 69.93KB
Perhaps, there are some platform specific features used in libuv. My dual core AMD E-350 CPU works faster with eventlets for some reason.
Something is very wrong if you're only getting 300 req/sec. I'm easily able to get 5,000 req/sec. What are the specs on your computer? Also, make sure you're not installing from pip; clone the repository and install that directly.
Also, can you try my bottle example above? Just do pip3 install -U --user bottle
and then copy/paste the example and test it. Bottle has much lower overhead than the other frameworks so we can see if TurboGears2 is not playing nicely with guv.
I'm going to add the epoll backend back into guv, so we can try that instead of libuv. I need to update the epoll backend to support some of the newer features, so give me a day to do that.
Also note that in your eventlet example, 422 requests timed out. In any production situation, if you have some requests going through and others timing out, that's not an acceptable tradeoff. If eventlet/gunicorn are doing some tricks to reduce latency and prioritize requests, that's worse than high latency overall.
Here is a bottle performance for your code:
wrk -t 2 -c 200 -d 25s http://suite:5001/ --latency
Running 25s test @ http://suite:5001/
2 threads and 200 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 341.16ms 33.51ms 573.45ms 91.33%
Req/Sec 294.48 176.72 600.00 51.28%
Latency Distribution
50% 345.02ms
75% 347.33ms
90% 349.02ms
99% 390.57ms
14374 requests in 25.06s, 1.77MB read
Requests/sec: 573.67
Transfer/sec: 72.27KB
Ofcourse, bottle is much lighter than tg2, but if i increase concurrent requests count (to 1000) - there will be much worse measurement:
2 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.70s 261.15ms 1.94s 92.37%
Req/Sec 281.36 117.98 590.00 69.85%
Latency Distribution
50% 1.77s
75% 1.78s
90% 1.81s
99% 1.86s
13841 requests in 25.05s, 1.70MB read
Socket errors: connect 0, read 0, write 0, timeout 140
Requests/sec: 552.64
Take a note on timeout requests.
Yeah there's no way your little E-350 can handle 1,000 concurrent requests with CPython no matter what web framework/green thread library you use. Not even my big Core i7 5820k does well with 1,000 concurrent requests with CPython. I'm compiling pypy3 from source now because I'm curious how it does in this situation.
Also note that you're running in a single-process context. Normally, you'd use something like gunicorn (guv supports gunicorn very well) and set processes to the number of processor cores you have. That will even out the load and reduce latency significantly. The test you're going after is not really meaningful because you'll never see this kind of single-process load in production. Even if you do, the way to solve it is extremely simple - just add another frontend server or add more processing cores. At 1,000 concurrent requests, you've most certainly hit the limits of what CPython can do on your CPU. To give you some context, I get 100,000 requests/s with Scala or 350,000 req/s with Java/Undertow on the exact same computer as I did the tests on (using all processor cores, however). CPython is just slow.
The eventlet test is not meaningful because of the large number of socket read errors and timeouts. It's essentially measuring the latency on a subset of the requests - only those that didn't have errors or timeouts, which isn't fair. The guv WSGI server tries to handle all connections and requests fairly and try very hard not to drop any requests.
Not a proplem to play with gunicorn. App code saved as test-bottle-guv.py:
import bottle
@bottle.route('/')
def index():
return 'Hello, world!'
app = bottle.default_app()
and server started:
gunicorn test-bottle-guv:app -b 0.0.0.0:5001 -k guv.GuvWorker -w 2
gives me:
Running 25s test @ http://suite:5001/
2 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.28s 343.82ms 1.93s 65.33%
Req/Sec 330.44 180.06 717.00 62.78%
Latency Distribution
50% 1.36s
75% 1.42s
90% 1.69s
99% 1.77s
16217 requests in 25.08s, 2.75MB read
Socket errors: connect 0, read 0, write 0, timeout 121
Requests/sec: 646.57
Transfer/sec: 112.40KB
And also, note on timeouts.. (maybe this is normal? Caused by ending of test time, which came seens as a timeout?)
Ok I'll try putting the epoll backend back into guv (I removed it a while ago because I thought I'd never need it again). I remember it performed very well, especially on pypy3. Perhaps it'll improve the latency.
Again, I'd suggest that you drop the concurrency down to 200 in your tests. 1,000 is very unreasonable for CPython and HTTP (it may be fine for websocket stuff though).
You are definitely right, decreasing load duty will improve performance. But there are not what i`m looking for. See how eventlets handles 1000 connections with gunicorn at the same bottle:
Running 25s test @ http://suite:5001/
2 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 2.93ms 19.94ms 1.56s 99.95%
Req/Sec 395.07 77.05 828.00 92.40%
Latency Distribution
50% 2.37ms
75% 2.57ms
90% 3.10ms
99% 7.09ms
19174 requests in 25.10s, 3.26MB read
Socket errors: connect 0, read 0, write 0, timeout 3
Requests/sec: 763.98
Transfer/sec: 132.81KB
Ok yeah something is definitely wrong with the latency, I'll take a look.
Thx!
Is there any progress with this issue? Looks like it needs very deep digging to find out where this latency comes from.
I didn't get a chance to look at it yet, but it doesn't require deep digging at all. The latency is primarily coming from the wsgi server, which is a modification of the gevent server. I'll try porting the eventlet wsgi server to guv and see what happens. If it's not that, then it can only be coming from the event loop. Guv isn't very deep and pretty easy to understand if you look at the core event loop and wsgi server code.
I'll try porting the eventlet wsgi server to guv and see what happens
Any success with it?
Hope, you didn't forget about trying another event loop.