oha icon indicating copy to clipboard operation
oha copied to clipboard

Throughput drops as time passes

Open Nekidev opened this issue 3 months ago • 15 comments

I cannot explain why, I was load testing a local server with oha and throughput progressively dropped as time passed. I thought it was a bug in my code so I spent a few hours trying to find out what the bug was, till I realized that when I switched oha with hey, the throughput suddenly did not drop anymore.

You can see how throughput drops with CPU usage, while with hey it does not:

Image

The declining CPU usage at the left is oha's load test, which goes all the way from 176k req/s to 800 req/s progressively over 2 minutes. The steady CPU usage after that is the same load test, same parameters, but using hey. Throughput does not drop with hey.

The commands are

$ hey -c 1000 -m POST -z 10m http://localhost:4000/v1/challenges/5ea2a70b-c085-428b-a4d5-0df47c110900
$ oha -c 1000 -m POST -z 10m http://localhost:4000/v1/challenges/5ea2a70b-c085-428b-a4d5-0df47c110900

When the test just started it looks like this:

Image

Time passes, it gets to this after 50 seconds:

Image

Wait till a minute and 15 seconds, it looks like this:

Image

Note how the average request time takes 200 microseconds to complete. With 1000 connections and without rate limits, that should be 5000 requests per second per connection in an ideal world. Throughput doesn't even get close here, it's roughly 7 requests per second per connection, but at 200 microseconds each, where's the missing time at?

Oha's CPU usage is at 2%, my server's CPU usage is at 5.5%, my system's CPU usage is at 17%, neither my server nor oha has rate limits, the system has 4.91 GB RAM available, network IO is at 30-40 MibPs when the network has ~1.5GibPs capacity.

Oha's output after 10 minutes of load test:
Summary:
  Success rate: 100.00%
  Total:        600.0204 sec
  Slowest:      1.2930 sec
  Fastest:      0.0000 sec
  Average:      0.0030 sec
  Requests/sec: 15085.9999

  Total data:   3.35 GiB
  Size/request: 397 B
  Size/sec:     5.71 MiB

Response time histogram:
  0.000 sec [1]       |
  0.129 sec [9049907] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.259 sec [0]       |
  0.388 sec [1]       |
  0.517 sec [0]       |
  0.646 sec [0]       |
  0.776 sec [0]       |
  0.905 sec [0]       |
  1.034 sec [0]       |
  1.164 sec [0]       |
  1.293 sec [1000]    |

Response time distribution:
  10.00% in 0.0002 sec
  25.00% in 0.0003 sec
  50.00% in 0.0022 sec
  75.00% in 0.0045 sec
  90.00% in 0.0067 sec
  95.00% in 0.0082 sec
  99.00% in 0.0108 sec
  99.90% in 0.0153 sec
  99.99% in 1.2015 sec


Details (average, fastest, slowest):
  DNS+dialup:   0.0316 sec, 0.0001 sec, 0.0952 sec
  DNS-lookup:   0.0000 sec, 0.0000 sec, 0.0070 sec

Status code distribution:
  [200] 9050909 responses

Error distribution:
  [999] aborted due to deadline
Hey's output after 10 minutes of load test:
Summary:
  Total:        534.2394 secs
  Slowest:      0.3340 secs
  Fastest:      0.0001 secs
  Average:      0.3290 secs
  Requests/sec: 86074.3175
  
  Total data:   18252024978 bytes
  Size/request: 18252 bytes

Response time histogram:
  0.000 [1]     |
  0.033 [997119]        |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.067 [1789]  |
  0.100 [89]    |
  0.134 [2]     |
  0.167 [0]     |
  0.200 [1]     |
  0.234 [79]    |
  0.267 [578]   |
  0.301 [341]   |
  0.334 [1]     |


Latency distribution:
  10% in 0.0025 secs
  25% in 0.0039 secs
  50% in 0.0059 secs
  75% in 0.0091 secs
  90% in 0.0136 secs
  95% in 0.0170 secs
  99% in 0.0252 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0000 secs, 0.0000 secs, 0.0252 secs
  DNS-lookup:   0.0011 secs, 0.0000 secs, 0.2526 secs
  req write:    0.0009 secs, 0.0000 secs, 0.0708 secs
  resp wait:    0.2595 secs, 0.0001 secs, 0.2372 secs
  resp read:    0.0446 secs, 0.0000 secs, 0.0705 secs

Status code distribution:
  [200] 1000000 responses

Error distribution:
  [9417]        Post "http://localhost:4000/v1/challenges/5ea2a70b-c085-428b-a4d5-0df47c110900": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Nekidev avatar Oct 18 '25 03:10 Nekidev

Maybe related to https://github.com/hatoo/oha/issues/475#issuecomment-2084627504

What if you added --no-tui option?

hatoo avatar Oct 18 '25 03:10 hatoo

Thanks for your quick reply! --no-tui is even worse.

Image

With TUI to the left, without TUI to the right. RAM was oscillating between 930-950 MB RAM with TUI, RAM was at 610 MB after the drop when I took the screenshot but now I checked again and it's at 1.0GB.

Nekidev avatar Oct 18 '25 03:10 Nekidev

Could you tell me OS and spec?

hatoo avatar Oct 18 '25 04:10 hatoo

Arch Linux, kernel 6.12.52-1-lts 13th Gen i7-1355U 15.1 GB RAM (16GB total)

I just made a really rough check over oha's code, seems that everything is stored in an SQLite in-memory DB? If that's so, I wouldn't be surprised about throughput dropping drastically after a few tens of million requests.

Nekidev avatar Oct 18 '25 04:10 Nekidev

everything is stored in an SQLite in-memory DB?

No, Vector of RequestResult is stored.

https://github.com/hatoo/oha/blob/2167ab4646026b9524e9ae407d3778427a243081/src/client.rs#L42-L59

SQLite is only used in final output

hatoo avatar Oct 18 '25 05:10 hatoo

RAM was oscillating between 930-950 MB RAM with TUI,

I think it's strange, because RAM consumption of oha should be only increase. Maybe memory was swapped? This hypothesis answers why CPU utility was going low.

hatoo avatar Oct 19 '25 04:10 hatoo

I have swap disabled, so that wouldn't be it. My uninformed guess is storage, but I don't know about oha's internals. How is request data stored as the load test occurs?

Nekidev avatar Oct 19 '25 08:10 Nekidev

request and response timing is stored in dynamic array

I couldn't reproduce such a dramatic throughput drop on my side. Could you tell me your server's language (including version) and framework?

hatoo avatar Oct 25 '25 06:10 hatoo

The server's in Rust too, axum. With a hello world server it happens too, at a lower scale, but I have to check if the throughput is seen in the CPU too.

Nekidev avatar Oct 25 '25 20:10 Nekidev

I tried to check with a long load test but I can't, stuff starts to get killed before I get a chance to test for over two minutes.

Nekidev avatar Oct 25 '25 21:10 Nekidev

Could you test with modified version of oha at #817 ? You can down load binaries from https://github.com/hatoo/oha/actions/runs/19035083308?pr=817

It reduces memory consumption. If this version keeps throughput longer, we can confirm that the root cause is likely memory consumption.

hatoo avatar Nov 04 '25 06:11 hatoo

Sure! I'll check it out in a few hours when I get to my laptop.

Nekidev avatar Nov 04 '25 15:11 Nekidev

Maybe related to #475 (comment)

I think it's even better explained by this issue: #364 isn't it?

quintesse avatar Nov 10 '25 10:11 quintesse

I agree, that's exactly what happened to me. I didn't see the fight for CPU usage that that issue says hypothetically, though. Both oha's and the server's COU dropped as the throughput went down.

Nekidev avatar Nov 10 '25 15:11 Nekidev

~I tried the binary of #817, but it doesn't even start now.~

The issue is still there, throughput drops as time passes.

Nekidev avatar Nov 12 '25 17:11 Nekidev