drogon icon indicating copy to clipboard operation
drogon copied to clipboard

jemalloc benchmarks

Open fwsGonzo opened this issue 3 years ago • 6 comments

Hey, I recently benchmarked using jemalloc on my NUMA servers, and the difference was quite large. Using the system allocator I got 710k req/s, while with jemalloc I got 830k req/s. That's 17% faster just by dynamically linking with a library.

Anyone else tried this?

fwsGonzo avatar Nov 25 '20 08:11 fwsGonzo

Hello and welcome! I have no experience as NUMA is a pretty niche architecture and I don’t have access to such hardware, but we certainly would welcome a write-up if you have the time and are willing to share your experience.

rbugajewski avatar Nov 25 '20 08:11 rbugajewski

We have two servers connected with a dual-port 200Gbit mellanox interface. One is an intel dual-socket 26-core Intel Xeon Gold, and the other is a dual-socket 48-core AMD EPYC. So, on the Intel you would see 104 logical CPUs, and on the AMD you would see 192 logical CPUs, due to hyper-threading.

Both servers are split in two with two separate but still globally accessible memory banks on both sides, each connected to a CPU.

https://gist.github.com/fwsGonzo/dc5706ad8002211bad7dd122cbd20e16

I've added two benchmarks with jemalloc disabled for comparison. In both cases the performance was lower. jemalloc was added by just linking to the system installed one on CentOS 8. I also updated the main post with the benchmark for jemalloc disabled. I was going from memory last time, but it was clearly not right. Still, jemalloc provided a significant gain.

The benchmarks are done by honing in on the highest stable req/s. Usually that means 1-2ms avg latency. Whenever you increase the load too much, one or more CPU cores will struggle and the benchmarks will have extreme spikes in them. I call it the breaking point, and I've been using it for a while now. wrk is wrk2 compiled from sources. It's not the most scientific approach but we can observe a difference between jemalloc and (I'm assuming) glibc's malloc.

The current bottleneck is most likely the kernel-side TCP acceptors. I think it may be possible to get even more speed just by creating more listeners on the same port, using REUSE_PORT, but we will see. Another potential benefit is being able to map a listener to a CPU and keep the packet on-CPU and possibly in-cache all the way. Regardless, worker threads are under-utilized while listeners look like they are always overloaded.

fwsGonzo avatar Nov 29 '20 18:11 fwsGonzo

@fwsGonzo Thanks for the write-up. That’s some detailed and very informative summary. Would it be OK for you if I would reuse your text, and include it in the official documentation?

@an-tao As this is performance related, and there are also measurable improvements >10%, I think it would be good information for people who want to further improve Drogon’s performance on their specific bare metal. Or do you think this shouldn’t be part of the docs, because it’s not primarily framework related? What do you think?

rbugajewski avatar Nov 29 '20 18:11 rbugajewski

I made a feeble attempt at allowing the same port to be bound twice within Drogon, but I have a suspicion that threads are somehow tied to ports as a key, because this is failing:

trantor::EventLoop::isInLoopThread (this=0x7ffff4731d88) at ../drogon/trantor/trantor/net/EventLoop.h:105
105	        return threadId_ == std::this_thread::get_id();

I ended up just running Drogon twice, and I saw no performance gains at all. I have no idea what those 4 threads that seem to be the bottleneck is then. Since I'm running two instances of Drogon where I'm assuming Linux will round-robin to each of them as they are listening on the same port, I can only assume that it's something else that is bottlenecked right now.

Image of the 4 threads that bottleneck the system: https://cloud.nwcs.no/index.php/s/NT7spdpRybY2Ha9

fwsGonzo avatar Nov 30 '20 11:11 fwsGonzo

@fwsGonzo Thanks so much for sharing your benchmark details. Drogon enables the SO_REUSEPORT option on linux, that means every IO thread in drogon listens on the same port, so you don't need to run multiple processes for performance. What did you set the threads_num option in the configuration file? Actually I used the mimalloc lib on the tfb benchmark, and it's very helpful for performance, I am very interested in comparing the effect of mimalloc and jemalloc on performance improvement @rbugajewski I agree with you. I think performance is always a consideration for Drogon users. It is a good idea to add this content to the drogon documentation.

an-tao avatar Nov 30 '20 12:11 an-tao

We have a new Intel server, and this is a repeatable synthetic benchmark:

$ ./wrk -c 520 -t 520 -d 15s http://192.168.0.10:8080/
Running 15s test @ http://192.168.0.10:8080/
  520 threads and 520 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   133.59us  129.44us  16.21ms   96.06%
    Req/Sec     7.75k     1.60k   18.59k    75.09%
  60776112 requests in 15.10s, 12.11GB read
Requests/sec: 4024610.36
Transfer/sec:    821.37MB

The URL returns a simple string:

$ curl -D - http://192.168.1.10:8080/
HTTP/1.1 200 OK
Content-Length: 76
Content-Type: text/html; charset=utf-8
Server: drogon/1.1.0
Date: Thu, 26 Aug 2021 13:05:31 GMT

	this is a very
	long string if I had the
	energy to type more and more ...

It's using all the CPUs and the kernel is working very hard! And yes, that is indeed 4M req(s. I was not using jemalloc at the time. With jemalloc enabled I get a very slightly performance reduction and I land repeatedly at 3.9M req/s for unknown reasons. jemalloc is very programmable so I guess that it could be solved and maybe even made to give a performance boost.

fwsGonzo avatar Aug 26 '21 13:08 fwsGonzo