perf-measures: re-introduce httpz
In the 0.12.0 branch, httpz was added to the perf measurements.
Apparently, somehow this got lost, which is a pity. httpz is super-promising.
Given the perf benchmarks in this
PR comment, I would have expected httpz to be on par with or better than zap in our measure.sh tests.
However, on my M3 max mac box, I get the following:
ZAP:
➜ zap git:(reintroduce_httpz_perf) ✗ ./wrk/measure.sh zig-zap
INFO: Listening on port 3000
Listening on 0.0.0.0:3000
INFO: Server is running 4 workers X 4 threads with facil.io 0.7.4 (kqueue)
* Detected capacity: 131056 open file limit
* Root pid: 73099
* Press ^C to stop
INFO: 73110 is running.
INFO: 73111 is running.
INFO: 73112 is running.
INFO: 73113 is running.
========================================================================
zig-zap
========================================================================
Running 10s test @ http://127.0.0.1:3000
4 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.31ms 533.41us 18.77ms 90.57%
Req/Sec 76.67k 9.04k 86.46k 84.25%
Latency Distribution
50% 1.15ms
75% 1.17ms
90% 1.75ms
99% 2.94ms
3052064 requests in 10.02s, 462.80MB read
Socket errors: connect 0, read 135, write 0, timeout 0
Requests/sec: 304601.19
Transfer/sec: 46.19MB
httpz:
➜ zap git:(reintroduce_httpz_perf) ✗ ./wrk/measure.sh httpz
========================================================================
httpz
========================================================================
Running 10s test @ http://127.0.0.1:3000
4 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 2.26ms 528.72us 18.84ms 84.61%
Req/Sec 44.46k 7.35k 85.50k 88.00%
Latency Distribution
50% 2.35ms
75% 2.39ms
90% 2.43ms
99% 3.26ms
1768925 requests in 10.01s, 91.10MB read
Socket errors: connect 0, read 230, write 0, timeout 0
Requests/sec: 176712.50
Transfer/sec: 9.10MB
Which looks way off. I must admit, I might have done a bad httpz implementation.
Seeking help from @karlseguin. My motivation: route people away from zap to alternatives like httpz or even zzz, as those are pure zig, and seem to be of really good performance. I want a world in which we don't have to resort to C frameworks to do good, zig-worthy servers :smile: to come true.
BTW: I am aware that taking perf measures on a mac is not what I usually do. I just don't have access to that Linux box ATM.
Super interesting!
im getting similar httpz numbers on both M2 pro and Ryzen 5/linux
but nowhere near 300 for zap - its more ballpark with the others. Impressive
you are making me want to get an M3 max !
It is weird how different my results are.
On an m2:
./wrk/measure.sh httpz
========================================================================
httpz
========================================================================
Running 10s test @ http://127.0.0.1:3000
4 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 2.10ms 5.27ms 106.64ms 96.99%
Req/Sec 61.13k 28.18k 241.88k 83.84%
Latency Distribution
50% 1.56ms
75% 1.85ms
90% 2.01ms
99% 25.35ms
2426670 requests in 10.10s, 124.97MB read
Socket errors: connect 0, read 386, write 0, timeout 0
Requests/sec: 240254.03
Transfer/sec: 12.37MB
./wrk/measure.sh zig-zap
========================================================================
zig-zap
========================================================================
Running 10s test @ http://127.0.0.1:3000
4 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 2.57ms 8.69ms 160.88ms 95.37%
Req/Sec 58.60k 35.92k 250.13k 74.94%
Latency Distribution
50% 665.00us
75% 1.13ms
90% 3.79ms
99% 45.87ms
2327928 requests in 10.09s, 352.99MB read
Socket errors: connect 0, read 387, write 0, timeout 0
Requests/sec: 230727.51
Transfer/sec: 34.99MB
On an E3-1275 v6
========================================================================
httpz
========================================================================
Running 10s test @ http://127.0.0.1:3000
4 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.28ms 377.20us 21.02ms 98.00%
Req/Sec 78.78k 3.40k 103.86k 74.00%
Latency Distribution
50% 1.26ms
75% 1.29ms
90% 1.37ms
99% 1.53ms
3136631 requests in 10.03s, 161.53MB read
Requests/sec: 312587.49
========================================================================
zig-zap
========================================================================
Running 10s test @ http://127.0.0.1:3000
4 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.34ms 1.92ms 54.45ms 95.66%
Req/Sec 93.75k 3.88k 111.99k 91.00%
Latency Distribution
50% 1.00ms
75% 1.09ms
90% 1.41ms
99% 10.63ms
3732001 requests in 10.05s, 565.90MB read
Requests/sec: 371458.89
Transfer/sec: 56.33MB
For a "hello world" example, the main thing you can do is tweak the worker count and thread pool size. But if you're running wrk on the same machine as the server, I don't think you have any cores to spare. I tried various settings for both, and they largely just hurt performance:
var server = try httpz.Server(void).init(allocator, .{
.port = 3000,
.workers = .{.count = 2},
.thread_pool = .{.count = 6},
}, {});
I'm starting to suspect/fear that httpz has some scaling issues. I tested it on a 32 vcpu cloud instance, and couldn't get it to scale linearly (or close to) with # of threads. Although, for the life of me, I can't figure out where the bottleneck is. Profiling showswritevas being the largest bottleneck. Gonna wait to see Anton's newest video, to see if he runs into the same thing...since his testing setup is better than mine.
Super interesting!
im getting similar httpz numbers on both M2 pro and Ryzen 5/linux
but nowhere near 300 for zap - its more ballpark with the others. Impressive
you are making me want to get an M3 max !
Thanks for sharing! Interesting to see the M2 pro numbers!
It is weird how different my results are.
On an m2:
[...]
Awesome! Thanks for trying it with your configurations! The httpz Linux Transfer/sec readings would have been interesting, got cut off, but nvm.
Looking at the differences on a Linux machine, httpz and zap don't seem that far off. Those are the only numbers that really matter IMHO because if you are serious about a server, you don't run it on a Mac - might be a hot take IDK.
For a "hello world" example, the main thing you can do is tweak the worker count and thread pool size. But if you're running
wrkon the same machine as the server, I don't think you have any cores to spare. I tried various settings for both, and they largely just hurt performance:var server = try httpz.Server(void).init(allocator, .{ .port = 3000, .workers = .{.count = 2}, .thread_pool = .{.count = 6}, }, {});
Yeah, you have to be careful not to allocate more cores than you have - and cores are not created equal, esp. on new Macs.
I'm starting to suspect/fear that httpz has some scaling issues. I tested it on a 32 vcpu cloud instance, and couldn't get it to scale linearly (or close to) with # of threads. Although, for the life of me, I can't figure out where the bottleneck is. Profiling shows
writevas being the largest bottleneck. Gonna wait to see Anton's newest video, to see if he runs into the same thing...since his testing setup is better than mine.
Very interesting! I have no clue wrt httpz either. Do you mean writev as bottleneck being a (syscall) contention bottleneck? Or is it time spent in writev?
Hypotheticals that come to mind: ... wait.
Actually, as food for thought, here are some ideas from ChatGPT :-)
-
Thread Contention on Shared Resources:
• File Descriptor Contention: If the worker threads are contending for access to shared file descriptors or other shared resources (e.g., logging, connection state), this can create bottlenecks. Even if epoll is only used for accepting connections, contention on these resources can slow down the overall processing.
-
CPU Cache Contention:
• Cache Line Contention: As the number of threads increases, the worker threads might start contending for CPU cache lines, especially if they are working on shared data structures or frequently accessing similar memory addresses. This can reduce performance and prevent linear scaling. • False Sharing: If threads are working on variables that are close together in memory but are supposed to be independent, they could cause false sharing, where updates to these variables cause unnecessary cache invalidations.
-
I/O Subsystem Bottlenecks:
• Network or Disk I/O Saturation: The worker threads are likely performing readv and writev on network sockets or disk files. If the I/O subsystem (network or disk) is saturated, adding more threads won’t increase throughput because the underlying hardware has reached its limit. • TCP/IP Stack Limits: On a heavily loaded server, the TCP/IP stack itself might become a bottleneck, especially if it’s handling a large number of connections. This can happen even if there are plenty of CPU cores available.
-
NUMA Effects:
• NUMA Node Misalignment: If your server is running on a NUMA (Non-Uniform Memory Access) architecture, threads might be running on different NUMA nodes and accessing memory that is not local to their node. This can significantly increase memory access latency and reduce scalability. Ensuring that threads are properly pinned to cores and that memory is allocated on the same NUMA node as the thread can help.
-
Load Imbalance:
• Uneven Workload Distribution: If some worker threads are processing more data or more complex requests than others, it can lead to load imbalance. This imbalance can cause some threads to become bottlenecks while others are underutilized, reducing overall scalability.
-
Thread Pool Overhead:
• Thread Management Overhead: The overhead of managing a large number of threads (e.g., context switching, task queue management) might increase as more threads are added. This overhead can limit scalability, especially if the thread pool implementation is not optimized for high concurrency.
-
Suboptimal Use of readv and writev:
• Inefficient Buffer Management: If the buffers used in readv and writev are not optimally sized or aligned, the performance benefits of these system calls may not be fully realized. This could lead to suboptimal I/O performance as the number of threads increases. • Partial Reads/Writes: If the worker threads are not handling partial reads/writes efficiently, this can lead to increased syscall overhead or I/O blocking, which can degrade performance as more threads are added.
Potential Solutions:
• Optimize Resource Access: Minimize contention on shared resources by using thread-local storage, lock-free data structures, or reducing the critical section of code that needs to be synchronized.
• NUMA Awareness: Ensure that threads are properly pinned to cores and that memory is allocated on the same NUMA node to reduce latency.
• Balance Load: Implement or improve load balancing mechanisms to ensure that work is evenly distributed among worker threads.
• Profile I/O Operations: Profile your I/O subsystem to identify any bottlenecks in the network or disk I/O, and optimize readv and writev usage by fine-tuning buffer sizes and ensuring efficient handling of partial reads/writes.
• Reduce Thread Pool Overhead: Consider tuning the thread pool size and task management to reduce overhead. Sometimes, fewer, more efficiently managed threads can outperform a larger pool with more overhead.
Sorry, it's a bit verbose, but I find it did a great job anyway :-)
closing this as I am getting rid of micro-benchmarks