Throughput drops as time passes
I cannot explain why, I was load testing a local server with oha and throughput progressively dropped as time passed. I thought it was a bug in my code so I spent a few hours trying to find out what the bug was, till I realized that when I switched oha with hey, the throughput suddenly did not drop anymore.
You can see how throughput drops with CPU usage, while with hey it does not:
The declining CPU usage at the left is oha's load test, which goes all the way from 176k req/s to 800 req/s progressively over 2 minutes. The steady CPU usage after that is the same load test, same parameters, but using hey. Throughput does not drop with hey.
The commands are
$ hey -c 1000 -m POST -z 10m http://localhost:4000/v1/challenges/5ea2a70b-c085-428b-a4d5-0df47c110900
$ oha -c 1000 -m POST -z 10m http://localhost:4000/v1/challenges/5ea2a70b-c085-428b-a4d5-0df47c110900
When the test just started it looks like this:
Time passes, it gets to this after 50 seconds:
Wait till a minute and 15 seconds, it looks like this:
Note how the average request time takes 200 microseconds to complete. With 1000 connections and without rate limits, that should be 5000 requests per second per connection in an ideal world. Throughput doesn't even get close here, it's roughly 7 requests per second per connection, but at 200 microseconds each, where's the missing time at?
Oha's CPU usage is at 2%, my server's CPU usage is at 5.5%, my system's CPU usage is at 17%, neither my server nor oha has rate limits, the system has 4.91 GB RAM available, network IO is at 30-40 MibPs when the network has ~1.5GibPs capacity.
Oha's output after 10 minutes of load test:
Summary:
Success rate: 100.00%
Total: 600.0204 sec
Slowest: 1.2930 sec
Fastest: 0.0000 sec
Average: 0.0030 sec
Requests/sec: 15085.9999
Total data: 3.35 GiB
Size/request: 397 B
Size/sec: 5.71 MiB
Response time histogram:
0.000 sec [1] |
0.129 sec [9049907] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.259 sec [0] |
0.388 sec [1] |
0.517 sec [0] |
0.646 sec [0] |
0.776 sec [0] |
0.905 sec [0] |
1.034 sec [0] |
1.164 sec [0] |
1.293 sec [1000] |
Response time distribution:
10.00% in 0.0002 sec
25.00% in 0.0003 sec
50.00% in 0.0022 sec
75.00% in 0.0045 sec
90.00% in 0.0067 sec
95.00% in 0.0082 sec
99.00% in 0.0108 sec
99.90% in 0.0153 sec
99.99% in 1.2015 sec
Details (average, fastest, slowest):
DNS+dialup: 0.0316 sec, 0.0001 sec, 0.0952 sec
DNS-lookup: 0.0000 sec, 0.0000 sec, 0.0070 sec
Status code distribution:
[200] 9050909 responses
Error distribution:
[999] aborted due to deadline
Hey's output after 10 minutes of load test:
Summary:
Total: 534.2394 secs
Slowest: 0.3340 secs
Fastest: 0.0001 secs
Average: 0.3290 secs
Requests/sec: 86074.3175
Total data: 18252024978 bytes
Size/request: 18252 bytes
Response time histogram:
0.000 [1] |
0.033 [997119] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.067 [1789] |
0.100 [89] |
0.134 [2] |
0.167 [0] |
0.200 [1] |
0.234 [79] |
0.267 [578] |
0.301 [341] |
0.334 [1] |
Latency distribution:
10% in 0.0025 secs
25% in 0.0039 secs
50% in 0.0059 secs
75% in 0.0091 secs
90% in 0.0136 secs
95% in 0.0170 secs
99% in 0.0252 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0000 secs, 0.0000 secs, 0.0252 secs
DNS-lookup: 0.0011 secs, 0.0000 secs, 0.2526 secs
req write: 0.0009 secs, 0.0000 secs, 0.0708 secs
resp wait: 0.2595 secs, 0.0001 secs, 0.2372 secs
resp read: 0.0446 secs, 0.0000 secs, 0.0705 secs
Status code distribution:
[200] 1000000 responses
Error distribution:
[9417] Post "http://localhost:4000/v1/challenges/5ea2a70b-c085-428b-a4d5-0df47c110900": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Maybe related to https://github.com/hatoo/oha/issues/475#issuecomment-2084627504
What if you added --no-tui option?
Thanks for your quick reply! --no-tui is even worse.
With TUI to the left, without TUI to the right. RAM was oscillating between 930-950 MB RAM with TUI, RAM was at 610 MB after the drop when I took the screenshot but now I checked again and it's at 1.0GB.
Could you tell me OS and spec?
Arch Linux, kernel 6.12.52-1-lts 13th Gen i7-1355U 15.1 GB RAM (16GB total)
I just made a really rough check over oha's code, seems that everything is stored in an SQLite in-memory DB? If that's so, I wouldn't be surprised about throughput dropping drastically after a few tens of million requests.
everything is stored in an SQLite in-memory DB?
No, Vector of RequestResult is stored.
https://github.com/hatoo/oha/blob/2167ab4646026b9524e9ae407d3778427a243081/src/client.rs#L42-L59
SQLite is only used in final output
RAM was oscillating between 930-950 MB RAM with TUI,
I think it's strange, because RAM consumption of oha should be only increase.
Maybe memory was swapped? This hypothesis answers why CPU utility was going low.
I have swap disabled, so that wouldn't be it. My uninformed guess is storage, but I don't know about oha's internals. How is request data stored as the load test occurs?
request and response timing is stored in dynamic array
I couldn't reproduce such a dramatic throughput drop on my side. Could you tell me your server's language (including version) and framework?
The server's in Rust too, axum. With a hello world server it happens too, at a lower scale, but I have to check if the throughput is seen in the CPU too.
I tried to check with a long load test but I can't, stuff starts to get killed before I get a chance to test for over two minutes.
Could you test with modified version of oha at #817 ? You can down load binaries from https://github.com/hatoo/oha/actions/runs/19035083308?pr=817
It reduces memory consumption. If this version keeps throughput longer, we can confirm that the root cause is likely memory consumption.
Sure! I'll check it out in a few hours when I get to my laptop.
I agree, that's exactly what happened to me. I didn't see the fight for CPU usage that that issue says hypothetically, though. Both oha's and the server's COU dropped as the throughput went down.
~I tried the binary of #817, but it doesn't even start now.~
The issue is still there, throughput drops as time passes.