reqwest Why is (concurrent) HTTP client request/response handling faster in Go than Rust?

I posted this on reddit earlier, but wanted to mention it on GitHub as well just to make sure it gets the right eyes on it --

I'm having a curious result attempting to implement seemingly-equivalent HTTP client request/response handling in Rust and Go...

I propose the following configuration:

A server returns large, gzip-compressed HTML pages (~100KB when gzip-compressed) of content, for any endpoint https://127.0.0.1:3000/k, where k is an unsigned 32-bit integer.
A variety of clients (hyper- and reqwest-based in Rust, http-based in Go), where each is configured to concurrently (using tokio::spawn in Rust, goroutines in Go), request n pages from the server where 0 <= k < n, and gzip-decompress the response body.

(For convenience, I uploaded a full repository containing this configuration here, including axum-based server code with self-signed for-dev-only certificates, and the client implementations as described.)

Regardless of the client configuration I have attempted, I find that the Go client implementation consistently beats any Rust client implementation in throughput/speed, and the throughput disparity scales as the number of pages requested increases. For n = 1000, on my machine (M1 Max MacBook Pro, running macOS Sequoia 15.0, with Rust stable 1.82.0), Go runs in ~7.5s, hyper ~8.3s, reqwest ~8.4s (just behind hyper). Of course, there is some variance here, but the overall theme is that Go is around ~1s faster on average for n = 1000. As I note, this difference scales with n. If we change to n=4000, for example, Go runs in ~30.5s, hyper ~35s, reqwest ~36s. Hence, there isn't some constant difference in performance; the performance difference between the two languages becomes exacerbated as n increases.

To this end, I wanted to make note of this issue to ascertain if anyone has hypotheses as to why this might be, or better yet, a solution if there is something missing in the Rust configuration that would allow the Rust implementation to achieve performance parity with or performance gain over the Go implementation. My layman's knowledge of the two languages suggests that Rust should be able to achieve performance at least on par with Go. A couple questions/hypotheses of my own, plus some sourced via reddit comments:

Are there possible inefficiencies with the hyper or hyper-backed reqwest client implementations (where Go has a "better" implementation for handling/processing responses)? => This comment suggests there is some known issue with a "global lock" in hyper that is causing efficiency issues (though I was unable to find a GitHub Issue discussing this).
Is the gzip decompression algorithm more-efficiently implemented in Go? => I tried switching to zlib-ng for decompression via the hyper client implementation (per this suggestion), though it didn't seem to offer modest performance gains (if any).
Do goroutines offer some performance benefit here compared to Rust's async/await or tokio::spawn?
Is there some (presumably default) client configuration in Go that is available for hyper/reqwest and is not being leveraged?
Would using a global allocator more-closely mimic Go's memory pool schema, and then lead to similar performance benefits? => My testing suggests this has little to no effect.
Is printing to stdout faster/more-efficient in Go? => My testing suggests this is not the case (or at least not the main source of a bottleneck).

Note that there are also some configuration changes we can make that improve Rust's performance (e.g., using HTTP/2 over HTTP/1.1) and should offer equivalent benefits in Go, but this is out of scope for the problem identified; Rust and Go implementations are making equivalent requests (both use HTTP/1.1, for example), but Go executes the full program in less time.

As I noted in my reddit comment here, I very much agree with others that the next reasonable step (barring some clear solution) is to profile both Go and Rust implementations in order to identify differences and bottlenecks. As I explain, however, I do not have the bandwidth to continue exploring this myself, but wanted to raise the issue in case (a) there is some easy performance win (e.g., a non-default hyper or reqwest client builder option that should be enabled in most scenarios) and/or some other prior research/investigation on this topic or (b) some true performance bottleneck with hyper/reqwest (or Rust typical concurrency patterns) that has not yet been identified and merits investigation.

Oct 25 '24 23:10 Jaltaire

I try not to involve myself much in these kinds of comparisons. I know they come from wanting to confirm something is faster, but I've helped too many production environments scale massively using Rust that I don't really want to try to identify what is wrong with a micro benchmark.

The one thing that stuck out as obvious on a quick look was that in Golang, the body is discarded, whereas in Rust it's being copied into a single flat buffer, which can mean wasted copies and reallocs, if the body is sufficiently large.

Oct 25 '24 23:10 seanmonstar

@seanmonstar, thank you for weighing in here.

I very much understand your point about micro benchmarking; my concern was more that I observed this pattern more-broadly across Go versus Rust web scraping (e.g., where page body size varies), so figured there might be something there. Granted, this is still from a relatively limited amount of testing (i.e., I have not been web scraping with reqwest for many months), so perhaps with more assessment it will turn out that differences are negligible or due to some non-hyper/reqwest limitation.

With regards to handling the response body, my goal here was simply to ensure that (a) the body was read (and in the case of the hyper client, also manually gzip-decompressed to match internal reqwest and Go implementations) and (b) the request/response was completed such that the session could be reused. Is there a better way to achieve this with hyper/reqwest in the case that the body can be discarded (or am I mistaken that reading the body is necessary to enabling re-use of the session)?

Oct 26 '24 10:10 Jaltaire

Yes, you do need to consume it, just like in Golang. But in Golang, it's copied to Discard (which does not actual copying), where in Rust, it's copied into a Vec. You could do something like:

while let Some(_) = resp.chunk().await? {
}

Otherwise, Rust is doing a bunch of extra copies and reallocations as it grows the vec, in the call to Response::bytes(), compared to Golang.

Oct 26 '24 11:10 seanmonstar

@seanmonstar, I tried draining the response body via .chunk() as you suggest, but not seeing any meaningful performance differences.

I did, however, stumble upon a somewhat unexpected result... I'm finding that if I avoid draining the body in Rust:

 // let _body = response.bytes().await?;

and equivalently in Go:

// _, err = io.Copy(io.Discard, resp.Body)

then with these changes, I get equivalent performance between Rust and Go. And surprisingly, with lower total execution time compared to draining the response body as we were doing previously. These two observations hold for n up to at least 10,000 (highest I tested), which I find notable given I was seeing very clear Rust/Go performance differences (i.e., Go's total execution time was multiple seconds faster) as low as n = 4,000 previously.

I'm finding this result surprising because I suspect these changes leak the connection resource and prevent the connection from being re-used. I should note, however, that in Go, I retain the call to resp.Body.Close(); if we remove it, we can only scrape precisely 246 pages (presumably hitting a file descriptor limit?) before getting read: connection reset by peer errors.

Perhaps I'm missing something here, but these are my top-of-mind questions:

Why would these changes result in better performance if we are indeed leaking connections?
Since these changes yield symmetric Rust/Go performance, does this speak to some difference in how the response body is being handled internally with hyper/reqwest in Rust versus http in Go? e.g.,
- Why does removing resp.Body.Close() seem to create a clear problem, and yet we don't encounter that same limitation with hyper/reqwest? (Maybe hyper/reqwest perform a similar "body.close()" internally via a destructor or destructor-style mechanism that I'm not seeing.)
- By removing the copying of the response body and obtaining equivalent cross-language performance, does this signify some inefficiency with how we are copying/streaming the body in Rust compared to Go?

Thanks again Sean for the convo and your expertise. I know you have myriad projects under your belt to tend to so I don't want to waste your time if you don't think there's a takeaway from any of this. Just let me know if and when you'd prefer I close the issue and I'm happy to explore this more on my own when the time allows. :)

Oct 28 '24 11:10 Jaltaire

A difference here is that the go sample (as it is in the repo linked from your reddit post) does not appear to be doing gzip decompression. Per the documentation:

        // DisableCompression, if true, prevents the Transport from
	// requesting compression with an "Accept-Encoding: gzip"
	// request header when the Request contains no existing
	// Accept-Encoding value. If the Transport requests gzip on
	// its own and gets a gzipped response, it's transparently
	// decoded in the Response.Body. However, if the user
	// explicitly requested gzip it is not automatically
	// uncompressed.

By specifying the Accept-Encoding: gzip header explicitly (l.59 in the go client sample), the last line in this documentation comes into effect and data is not automatically decompressed. This can be seen if one simply dumps the content of the returned body in the go example that it is not decoded. Similarly, doing a profile of the run will show no time spent in flate related code.

Removing the explicit add of the header, the server still reports accept-encoding: gzip (the Transport has automatically added it), but now the automatic decoding does take place. Inspecting the body shows decoded content, and we see flate in the flamegraph.

With this change, the clients have similar behavior on my machine (rust was actually slightly faster when also doing the chunk() modification).

Oct 28 '24 14:10 pfernie

@pfernie, thank you for a more-prudent reading of the Go documentation. :) I updated the http-client-compare repo with these changes.

This definitely narrows the gap between reqwest_client and go_client performance, though I still see that Go is faster than Rust on average. When n = 1,000, it is frankly hard to observe a performance difference. Increasing n, however, makes this more visible. On my machine, if we set n = 4,000, then go_client is ~1 second faster. If we set n = 10,000, then go_client is ~2.5-3 seconds faster. (I changed the default n to 4,000 in the repo.)

Hence, there still seems to be some non-constant performance difference between the two implementations (whether that is baked in to hyper/reqwest, the Rust concurrency model, or something else entirely).

Oct 30 '24 08:10 Jaltaire