nighthawk Issues around max QPS

I would like to send requests at the maximum RPS possible. Many other tools (hey: default, fortio: -qps 0) support this mode.

In nighthawk, --rps must be between "value must be inside range [" '\x01' ", " '\U000f4240' "] (yay for horrible error messages from envoy, this is 1-1million though).

If I request 1M rps, I get an actual of 10k RPS. If I request 50k RPS, I am able to get 50k RPS. This indicates nighthawk is getting throttled unnecessarily, while the server can handle more load.

1M request logs

$ nighthawk --rps 1000000 --concurrency 1 --duration 10 --connections 64 --prefetch-connections --max-active-requests 1000 http://127.0.0.1:8080
[12:25:26.352783][854812][I] Starting 1 threads / event loops. Time limit: 10 seconds.
[12:25:26.352823][854812][I] Global targets: 64 connections and 1000000 calls per second.
[12:25:26.353462][854812][W] Unable to use runtime singleton for feature envoy.http.headermap.lazy_map_min_size
[12:25:36.902843][854818][I] Stopping after 10000 ms. Initiated: 103381 / Completed: 103380. (Completion rate was 10337.99689860093 per second.)
Nighthawk - A layer 7 protocol benchmarking tool.

benchmark_http_client.latency_2xx (103317 samples)
  min: 0s 000ms 034us | mean: 0s 000ms 087us | max: 0s 005ms 736us | pstdev: 0s 000ms 041us

  Percentile  Count       Value
  0.5         51659       0s 000ms 084us
  0.75        77488       0s 000ms 093us
  0.8         82654       0s 000ms 096us
  0.9         92988       0s 000ms 104us
  0.95        98152       0s 000ms 115us
  0.990625    102349      0s 000ms 163us
  0.99902344  103217      0s 000ms 594us

Queueing and connection setup latency (103318 samples)
  min: 0s 000ms 001us | mean: 0s 000ms 003us | max: 0s 025ms 735us | pstdev: 0s 000ms 080us

  Percentile  Count       Value
  0.5         51675       0s 000ms 003us
  0.75        77510       0s 000ms 004us
  0.8         82656       0s 000ms 004us
  0.9         92997       0s 000ms 005us
  0.95        98156       0s 000ms 005us
  0.990625    102350      0s 000ms 011us
  0.99902344  103218      0s 000ms 016us

Request start to response end (103317 samples)
  min: 0s 000ms 034us | mean: 0s 000ms 087us | max: 0s 005ms 735us | pstdev: 0s 000ms 041us

  Percentile  Count       Value
  0.5         51669       0s 000ms 083us
  0.75        77491       0s 000ms 093us
  0.8         82659       0s 000ms 096us
  0.9         92987       0s 000ms 104us
  0.95        98153       0s 000ms 115us
  0.990625    102349      0s 000ms 162us
  0.99902344  103217      0s 000ms 593us

Response body size in bytes (103317 samples)
  min: 0 | mean: 0.0 | max: 0 | pstdev: 0.0

Response header size in bytes (103317 samples)
  min: 58 | mean: 58.0 | max: 58 | pstdev: 0.0

Blocking. Results are skewed when significant numbers are reported here. (103318 samples)
  min: 0s 000ms 018us | mean: 0s 000ms 096us | max: 0s 005ms 778us | pstdev: 0s 000ms 042us

  Percentile  Count       Value
  0.5         51659       0s 000ms 092us
  0.75        77490       0s 000ms 104us
  0.8         82656       0s 000ms 107us
  0.9         92989       0s 000ms 116us
  0.95        98154       0s 000ms 129us
  0.990625    102350      0s 000ms 180us
  0.99902344  103218      0s 000ms 607us

Initiation to completion (103380 samples)
  min: 0s 000ms 037us | mean: 0s 000ms 108us | max: 0s 026ms 479us | pstdev: 0s 000ms 635us

  Percentile  Count       Value
  0.5         51694       0s 000ms 089us
  0.75        77539       0s 000ms 099us
  0.8         82712       0s 000ms 103us
  0.9         93042       0s 000ms 112us
  0.95        98211       0s 000ms 124us
  0.990625    102411      0s 000ms 176us
  0.99902344  103280      0s 000ms 773us

Counter                                 Value       Per second
benchmark.http_2xx                      103317      10331.70
benchmark.pool_overflow                 63          6.30
cluster_manager.cluster_added           1           0.10
default.total_match_count               1           0.10
membership_change                       1           0.10
runtime.load_success                    1           0.10
runtime.override_dir_not_exists         1           0.10
upstream_cx_http1_total                 64          6.40
upstream_cx_rx_bytes_total              7748775     774877.27
upstream_cx_total                       64          6.40
upstream_cx_tx_bytes_total              6095762     609576.02
upstream_rq_pending_overflow            63          6.30
upstream_rq_pending_total               1           0.10
upstream_rq_total                       103318      10331.80

50k request logs

[12:26:36.753033][855204][I] Starting 1 threads / event loops. Time limit: 10 seconds.
[12:26:36.753072][855204][I] Global targets: 64 connections and 50000 calls per second.
[12:26:36.753695][855204][W] Unable to use runtime singleton for feature envoy.http.headermap.lazy_map_min_size
[12:26:47.303113][855210][I] Stopping after 10000 ms. Initiated: 499999 / Completed: 499996. (Completion rate was 49999.475001312494 per second.)
Nighthawk - A layer 7 protocol benchmarking tool.

benchmark_http_client.latency_2xx (499965 samples)
  min: 0s 000ms 026us | mean: 0s 000ms 169us | max: 0s 007ms 123us | pstdev: 0s 000ms 180us

  Percentile  Count       Value
  0.5         249986      0s 000ms 086us
  0.75        374976      0s 000ms 230us
  0.8         399980      0s 000ms 282us
  0.9         449975      0s 000ms 410us
  0.95        474967      0s 000ms 495us
  0.990625    495278      0s 000ms 759us
  0.99902344  499477      0s 001ms 405us

Queueing and connection setup latency (499968 samples)
  min: 0s 000ms 000us | mean: 0s 000ms 001us | max: 0s 016ms 741us | pstdev: 0s 000ms 023us

  Percentile  Count       Value
  0.5         250156      0s 000ms 001us
  0.75        375423      0s 000ms 002us
  0.8         400613      0s 000ms 002us
  0.9         450102      0s 000ms 002us
  0.95        474994      0s 000ms 002us
  0.990625    495281      0s 000ms 003us
  0.99902344  499480      0s 000ms 008us

Request start to response end (499965 samples)
  min: 0s 000ms 026us | mean: 0s 000ms 169us | max: 0s 007ms 122us | pstdev: 0s 000ms 180us

  Percentile  Count       Value
  0.5         249992      0s 000ms 086us
  0.75        374976      0s 000ms 230us
  0.8         399975      0s 000ms 281us
  0.9         449973      0s 000ms 410us
  0.95        474968      0s 000ms 495us
  0.990625    495278      0s 000ms 759us
  0.99902344  499477      0s 001ms 405us

Response body size in bytes (499965 samples)
  min: 0 | mean: 0.0 | max: 0 | pstdev: 0.0

Response header size in bytes (499965 samples)
  min: 58 | mean: 5unneccesarily8.0 | max: 58 | pstdev: 0.0

Blocking. Results are skewed when significant numbers are reported here. (4370 samples)
  min: 0s 000ms 014us | mean: 0s 000ms 264us | max: 0s 005ms 615us | pstdev: 0s 000ms 152us

  Percentile  Count       Value
  0.5         2185        0s 000ms 253us
  0.75        3278        0s 000ms 350us
  0.8         3496        0s 000ms 357us
  0.9         3933        0s 000ms 373us
  0.95        4152        0s 000ms 386us
  0.990625    4330        0s 000ms 472us
  0.99902344  4366        0s 001ms 388us

Initiation to completion (499996 samples)
  min: 0s 000ms 029us | mean: 0s 000ms 178us | max: 0s 017ms 698us | pstdev: 0s 000ms 187us

  Percentile  Count       Value
  0.5         250004      0s 000ms 091us
  0.75        374999      0s 000ms 243us
  0.8         400002      0s 000ms 298us
  0.9         450000      0s 000ms 435us
  0.95        474998      0s 000ms 518us
  0.990625    495309      0s 000ms 771us
  0.99902344  499508      0s 001ms 424us

Counter                                 Value       Per second
benchmark.http_2xx                      499965      49996.37
benchmark.pool_overflow                 31          3.10
cluster_manager.cluster_added           1           0.10
default.total_match_count               1           0.10
membership_change                       1           0.10
runtime.load_success                    1           0.10
runtime.override_dir_not_exists         1           0.10
upstream_cx_http1_total                 64          6.40
upstream_cx_rx_bytes_total              37497375    3749727.91
upstream_cx_total                       64          6.40
upstream_cx_tx_bytes_total              29498112    2949803.65
upstream_rq_pending_overflow            31          3.10
upstream_rq_pending_total               2           0.20
upstream_rq_total                       499968      49996.67

[12:26:47.424905][855204][I] Done.

Both of these are from nighthawk --rps ... --concurrency 1 --duration 10 --connections 64 --prefetch-connections, against fortio server on the same machine.

Based on this, the main thing I am seeing is pool_overflow=63 for the 1M requests, but pool_overflow=31 for 50k request.

I was a bit surprised the 50k QPS showed pool_overflow=31, despite everything seeming fine.

It would be great if some combination of these were addressed:

Add an option to request "max RPS"
Degrade gracefully when requested RPS > server/client max RPS.
- I would expect the invariant that, barring test variance, as request RPS increases, actual RPS either increases or maintains constant. Today, at some breaking point, it actual QPS starts decreasing
If this is expected due to some pool_overflow, add some documentation on what settings can be tuned to ensure we do not overflow. I didn't see much here, but perhaps I missed something.

Results of varying QPS settings:

DEST                   TOOL       QPS      CONS  DUR  SUCCESS  THROUGHPUT   P50     P90     P99
http://127.0.0.1:8080  nighthawk  10000    64    10   99968    9996.89qps   0.08ms  0.12ms  0.34ms
http://127.0.0.1:8080  nighthawk  50000    64    10   499966   49996.70qps  0.09ms  0.42ms  0.77ms
http://127.0.0.1:8080  nighthawk  75000    64    10   687395   68740.33qps  0.30ms  0.41ms  1.03ms
http://127.0.0.1:8080  nighthawk  100000   64    10   690584   69058.48qps  0.30ms  0.42ms  1.03ms
http://127.0.0.1:8080  nighthawk  200000   64    10   688416   68843.29qps  0.30ms  0.42ms  1.03ms
http://127.0.0.1:8080  nighthawk  500000   64    10   694161   69417.19qps  0.30ms  0.40ms  1.02ms
http://127.0.0.1:8080  nighthawk  750000   64    10   103356   10335.69qps  0.08ms  0.10ms  0.16ms
http://127.0.0.1:8080  nighthawk  1000000  64    10   103350   10335.09qps  0.08ms  0.10ms  0.16ms

Nov 23 '20 20:11 howardjohn

With --open-loop I would expect more overflows, but in closed loop mode we report time blocked on resource shortages (pool queue size / connections) instead in a histogram. I would expect that the histogram indicating the amount/time NH is blocked to rise when going from 50k to 1M rps. That seems to be happening in the examples above.

I think throughput degrades when setting qps super high because NH will spend most of its time processing failed attempts, to a point where it significantly interferes with i/o processing on the event loop.

For a dedicated max qps feature: I'm thinking that maybe we want to specialise there, and have a separate flag. Instead of letting NH set the pace, we should let inbound responses directly trigger NH to issue a new request. I suspect this would emulate some of the other tooling. This would also allow us to make this compatible with configurations that influence pacing, like --jitter. This should approach the actual max qps a server can handle, though maybe in some cases, a slightly higher qps could be obtained, as in this flow we're not always putting the server to work on a new reply while it is sending us one.

Nov 23 '20 22:11 oschaaf

@oschaaf is there any way to get it to not get blocked? It seems odd the max throughput decays when we get too high. For example, we have 10331.70 requests and 64, so we need 161 RPS/connection. This gives us 6ms to send each request, which is far far longer than the reported latencies?

BTW, this is just something I personally stumbled upon, I don't think Istio does any thoroughput testing

Nov 23 '20 22:11 howardjohn

Circling back to this, one useful trick may be to increase queue depth on the client, by tweaking --max-pending-requests:

That allows some wiggle room with respect to any back-pressure as the system will have an opportunity to queue up requests when there's no free connections/streams to put them in to. What will happen is that on the first opportunity window, which arises when a reply comes in and triggers an i/o event, a queued request will be taken from the queue and assigned to a stream/connection. So effectively this gives some more wiggle room for the request-release pacing to ramp up to try and to maintain the configured pace.

Mar 20 '21 15:03 oschaaf