hyper
hyper copied to clipboard
Performance degradation in release mode compared to debug
Version
h2 v0.3.13 hyper v0.14.19 (full set of features enabled)
Platform Darwin ip-192-168-128-190.eu-west-1.compute.internal 21.5.0 Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:37 PDT 2022; root:xnu-8020.121.3~4/RELEASE_ARM64_T6000 arm64
Description Trying a naive piece of code for benchmarking http performance:
use std::convert::Infallible;
use hyper::{
server::conn::AddrStream,
service::{make_service_fn, service_fn},
Body, Request, Response, Server,
};
#[tokio::main]
async fn main() {
let app = make_service_fn(move |_stream: &AddrStream| async move {
Ok::<_, Infallible>(service_fn(hello_world))
});
Server::bind(&([127, 0, 0, 1], 8888).into())
.serve(app)
.await
.unwrap();
}
async fn hello_world(_req: Request<Body>) -> Result<Response<Body>, Infallible> {
Ok(Response::builder()
.body(Body::from("Hello World!\n"))
.unwrap())
}
Running it and benchmarking it with hey (hey http://127.0.0.1:8888). All running on Rust 1.61.0, on a Macbook Pro with M1 Max (16").
What I get is quite surprising, in debug mode, the results are as follows:
Summary:
Total: 0.0090 secs
Slowest: 0.0050 secs
Fastest: 0.0010 secs
Average: 0.0020 secs
Requests/sec: 22206.2880
Total data: 2600 bytes
Size/request: 13 bytes
Response time histogram:
0.001 [1] |
0.001 [137] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.002 [12] |■■■■
0.002 [0] |
0.003 [0] |
0.003 [0] |
0.003 [0] |
0.004 [4] |■
0.004 [27] |■■■■■■■■
0.005 [17] |■■■■■
0.005 [2] |■
Latency distribution:
10% in 0.0011 secs
25% in 0.0012 secs
50% in 0.0013 secs
75% in 0.0037 secs
90% in 0.0042 secs
95% in 0.0043 secs
99% in 0.0047 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0007 secs, 0.0010 secs, 0.0050 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs
req write: 0.0000 secs, 0.0000 secs, 0.0000 secs
resp wait: 0.0013 secs, 0.0010 secs, 0.0025 secs
resp read: 0.0000 secs, 0.0000 secs, 0.0000 secs
Status code distribution:
[200] 200 responses
While in release mode (cargo run --release), I get this:
Summary:
Total: 0.0122 secs
Slowest: 0.0072 secs
Fastest: 0.0013 secs
Average: 0.0025 secs
Requests/sec: 16395.8496
Total data: 2600 bytes
Size/request: 13 bytes
Response time histogram:
0.001 [1] |
0.002 [149] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.002 [0] |
0.003 [0] |
0.004 [0] |
0.004 [0] |
0.005 [25] |■■■■■■■
0.005 [7] |■■
0.006 [7] |■■
0.007 [5] |■
0.007 [6] |■■
Latency distribution:
10% in 0.0014 secs
25% in 0.0016 secs
50% in 0.0017 secs
75% in 0.0043 secs
90% in 0.0053 secs
95% in 0.0064 secs
99% in 0.0072 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0009 secs, 0.0013 secs, 0.0072 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs
req write: 0.0000 secs, 0.0000 secs, 0.0001 secs
resp wait: 0.0016 secs, 0.0013 secs, 0.0028 secs
resp read: 0.0000 secs, 0.0000 secs, 0.0000 secs
Status code distribution:
[200] 200 responses
As you can see, average, slowest, AND fastest request times are slower in release, as well as the total amount of time the benchmark took to run... I did not expect release performance to be the same as debug, not to mention worse...
Any help figuring this out would be greatly appreciated!
I have a suspicion that this might be a macOS bug, as nextest has seem a very similar bug be reported: https://github.com/nextest-rs/nextest/issues/262.
While it is most definitely a mac-specific bug (or M1-specific at least), I honestly doubt the relation to the nextest bug. The nextest bug seems more centered around process creation and directory listing, which hyper should not be doing on any platform :-) The example above is purely network-related, so no process creation or file I/O should be present here...
EDIT: file I/O is not directly related. The nextest bug is because nextest spawns a lot of processes, which requires approval from macOS's security validation. My example is just a single process, so this is not relevant in my case
This is very surprising. I've never seen the performance get worse in release mode. I don't have a Mac, so I can't really dig in, but it seems unlikely it's because of hyper.
Could it be the rustc version combined with your macOS version?
It reproduces across multiple versions of rustc, and I can't get it to reproduce on plain computational code (e.g. building and looking up in hashmaps, loops and numeric operations) - in all those cases, release performs significantly better than debug.
I might be able to take a stab at trying it with plain tokio tcp handling, seeing if it reproduces then...
I found that changing tokio::main to use the current_thread flavor causes the performance to be better in release than debug, so I'm guessing the problem is scheduling or contention here...
In any case this doesn't reproduce with a plain tokio socket server imitating HTTP...
👋
I'd love to unpack this a bit!
Please sound off with objections to how I interpreted the issues, my hypotheses, testing methodologies, or result analysis. I am very open to feedback :heart:
Issues
Hyper performance degrades when compiling with the default release profile
Conversation
Running it and benchmarking it with hey (hey http://127.0.0.1:8888). All running on Rust 1.61.0, on a Macbook Pro with M1 Max (16").
...
As you can see, average, slowest, AND fastest request times are slower in release, as well as the total amount of time the benchmark took to run... I did not expect release performance to be the same as debug, not to mention worse...
Hypothesis
Rustc optimizations improve hyper performance. However, application-external aspects of the full request lifecycle introduce enough variance to result in comparable performance across compiler profiles when measured as a whole. This can be exaggerated with small sample sets and powerful hardware.
Research
What is the appropriate way to describe the aspect of work in a TCP socket transaction that remains constant (or dynamic rather) regardless of the compiler profile being used?
Applications that wait for work, like web servers with pools of threads waiting on a socket, present a challenge for off-CPU analysis: often most of the blocking time will be in stacks waiting for work, rather than doing work.^1
"off-CPU" seems like a strong candidate for what my theory suggests.
- Notice how "Network I/O" is categorized in this way.
My hypothesis is evolving to something more like, 'Rustc cannot make aspects of your machine, outside the application code, more efficient or consistent.'
hey's relevant, default configurations:
| arg | desc |
|---|---|
| -n | Number of requests to run. Default is 200. |
| -c | Number of workers to run concurrently. Total number of requests cannot be smaller than the concurrency level. Default is 50. |
| -disable-keepalive | Disable keep-alive, prevents re-use of TCP connections between different HTTP requests. default is enabled |
| -cpus | Number of used cpu cores. default for current machine is 8 cores |
My concerns about sample size are increasing, doubly so looking at the degree of concurrency.
To isolate the tool in testing, I will also be using ApacheBench with an equivalent command:
ab -k -n 200 -c 50 -q http://127.0.0.1:8888/ | grep 'Time taken'
Secondary Questions
- Hyper performance degradation is specific to MacOS/Apple silicon
While it is most definitely a mac-specific bug (or M1-specific at least)
- Tokio's multi-thread runtime can negatively impact hyper's performance in release mode
I found that changing tokio::main to use the current_thread flavor causes the performance to be better in release than debug, so I'm guessing the problem is scheduling or contention here...
- A tokio tcp server equipped with http1.1 responses doesn't degrade in release mode
In any case this doesn't reproduce with a plain tokio socket server imitating HTTP...
Hypothesis
A socket server implemented with any combination of os, language, or framework will experience a degree of I/O variance. With a powerful computer that variance could over power the executing code's performance gains from compiler optimizations.
Testing
To isolate each component of this issue I have tested with different combinations of benchmarking tools, profiling tools, operating systems, and server implementations.
To flip this issue on its head I also put together a wildcard test case with every component (mac, rust, hyper) changed to look for a similar result [^2].
| server | client command | MacOS (ARM) | Linux (container) | Windows 10 |
|---|---|---|---|---|
| hyper debug | hey http://localhost:8888 |
13.0 ms | ||
| hyper release | hey http://localhost:8888 |
14.2 ms | ||
| hyper debug | ab -k -n 200 -c 50 -q http://127.0.0.1:8888/ |
9.8 ms | 8.6 ms | 11.2 ms |
| hyper release | ab -k -n 200 -c 50 -q http://127.0.0.1:8888/ |
10.6 ms | 7 ms | 11.6 ms |
| hyper debug | ab -k -n 50000 -c 50 -q http://127.0.0.1:8888/ |
401.8 ms | 286.6 ms | 838 ms |
| hyper release | ab -k -n 50000 -c 50 -q http://127.0.0.1:8888/ |
358.2 ms | 230.4 ms | 582.8 ms |
| hyper debug current_thread | ab -k -n 200 -c 50 -q http://127.0.0.1:8888/ |
12.4 ms | ||
| hyper release current_thread | ab -k -n 200 -c 50 -q http://127.0.0.1:8888/ |
10.4 ms | ||
| tokio debug | hey http://localhost:8888 |
13.2 ms | ||
| tokio release | hey http://localhost:8888 |
12.2 ms | ||
| tokio debug | ab -k -n 200 -c 50 -q http://127.0.0.1:8888/ |
11.2 ms | 6.4 ms | |
| tokio release | ab -k -n 200 -c 50 -q http://127.0.0.1:8888/ |
11.5 ms | 6 ms | |
| dotnet debug | ab -k -n 200 -c 50 -q http://127.0.0.1:5098/helloworld/ |
40.8 ms | ||
| dotnet release | ab -k -n 200 -c 50 -q http://127.0.0.1:5098/helloworld/ |
41 ms |
- server titles link to the code
- results are the mean time-to-complete of 5 consecutive runs
- empty = N/A
Let's go over each issue in turn and compare with the data.
-
Hyper performance degrades in release On all tested platforms there is a possibility for a release build to under perform debug with a small sample set. By shifting any of the present variables: handler processing, degree of concurrency, or number of requests, you can see release consistently out perform debug by a large margin.
-
Hyper issues are specific to Mac See the linux and windows tests. The reported issue is present on all platforms tested with the supplied benchmarks.
-
#[tokio::main(flavor = "current_thread")]doesn't have the same issue In the included results it shows that this is true. However, in larger tests I found that limiting tokio's runtime hurt overall performance with an adequately sized stress test. I believe the reasoning in question 1 holds true for this phenomenon. -
A tokio TCP server doesn't have the same problems as hyper My tokio testings found that the same performance swings are present.
Tracing
Next, I used cargo-flamegraph to take a closer look at the executing code to identify abnormal behaviors on MacOS between debug and release. The following graphs describe the code executed while handling: ab -k -n 200 -c 50 -q http://127.0.0.1:8888/.
These graphs represent data produced with DTrace. (Click the pictures to visit interactive svg with searching!)
Debug:
Release:
Everything seems in order, but flamegraphs are a beginning, not an end.
Time spent on the CPU is not the full picture, as time is spent waiting for IO to complete as well, which does not get accounted with tools like perf that only measure what's consuming time on the CPU. Check out Brendan Gregg's article on Off-Cpu Accounting for more information about this![^3]
Let's take that next step by attaching tracing instruments to our hyper server! (Integration Reference | supplemental reading)
tracing_subscriber::fmt()
.with_span_events(FmtSpan::FULL)
.with_target(false)
.with_level(false)
.init();
#[instrument(skip_all)]
async fn hello_world(_req: Request<Body>) -> Result<Response<Body>, Infallible> {
Ok(Response::builder()
.body(Body::from("Hello World!\n"))
.unwrap())
}
Our server received a huge performance boost by compiling to release.
Conclusion
I believe hyper is in the clear. My testing did not demonstrate any profile-related performance shortcomings from hyper's codebase on Mac.
[^2]: C# is a compiled (and then interpreted) language with compiler profiles that claim to produce more efficient code in release mode. [^3]: https://github.com/flamegraph-rs/flamegraph#flamegraphs-are-the-beginning-not-the-end
Excellent write up. I think we can close up here.