zap perf-measures: re-introduce httpz

In the 0.12.0 branch, httpz was added to the perf measurements.

Apparently, somehow this got lost, which is a pity. httpz is super-promising.

Given the perf benchmarks in this PR comment, I would have expected httpz to be on par with or better than zap in our measure.sh tests.

However, on my M3 max mac box, I get the following:

ZAP:

➜  zap git:(reintroduce_httpz_perf) ✗ ./wrk/measure.sh zig-zap
INFO: Listening on port 3000
Listening on 0.0.0.0:3000
INFO: Server is running 4 workers X 4 threads with facil.io 0.7.4 (kqueue)
* Detected capacity: 131056 open file limit
* Root pid: 73099
* Press ^C to stop

INFO: 73110 is running.
INFO: 73111 is running.
INFO: 73112 is running.
INFO: 73113 is running.
========================================================================
                          zig-zap
========================================================================
Running 10s test @ http://127.0.0.1:3000
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.31ms  533.41us  18.77ms   90.57%
    Req/Sec    76.67k     9.04k   86.46k    84.25%
  Latency Distribution
     50%    1.15ms
     75%    1.17ms
     90%    1.75ms
     99%    2.94ms
  3052064 requests in 10.02s, 462.80MB read
  Socket errors: connect 0, read 135, write 0, timeout 0
Requests/sec: 304601.19
Transfer/sec:     46.19MB

httpz:

➜  zap git:(reintroduce_httpz_perf) ✗ ./wrk/measure.sh httpz
========================================================================
                          httpz
========================================================================
Running 10s test @ http://127.0.0.1:3000
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.26ms  528.72us  18.84ms   84.61%
    Req/Sec    44.46k     7.35k   85.50k    88.00%
  Latency Distribution
     50%    2.35ms
     75%    2.39ms
     90%    2.43ms
     99%    3.26ms
  1768925 requests in 10.01s, 91.10MB read
  Socket errors: connect 0, read 230, write 0, timeout 0
Requests/sec: 176712.50
Transfer/sec:      9.10MB

Which looks way off. I must admit, I might have done a bad httpz implementation.

Seeking help from @karlseguin. My motivation: route people away from zap to alternatives like httpz or even zzz, as those are pure zig, and seem to be of really good performance. I want a world in which we don't have to resort to C frameworks to do good, zig-worthy servers :smile: to come true.

Sep 20 '24 13:09 renerocksai

BTW: I am aware that taking perf measures on a mac is not what I usually do. I just don't have access to that Linux box ATM.

Sep 20 '24 13:09 renerocksai

Super interesting!

im getting similar httpz numbers on both M2 pro and Ryzen 5/linux

but nowhere near 300 for zap - its more ballpark with the others. Impressive

you are making me want to get an M3 max !

Sep 20 '24 21:09 zigster64

It is weird how different my results are.

On an m2:

./wrk/measure.sh httpz
========================================================================
                          httpz
========================================================================
Running 10s test @ http://127.0.0.1:3000
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.10ms    5.27ms 106.64ms   96.99%
    Req/Sec    61.13k    28.18k  241.88k    83.84%
  Latency Distribution
     50%    1.56ms
     75%    1.85ms
     90%    2.01ms
     99%   25.35ms
  2426670 requests in 10.10s, 124.97MB read
  Socket errors: connect 0, read 386, write 0, timeout 0
Requests/sec: 240254.03
Transfer/sec:     12.37MB

./wrk/measure.sh zig-zap 
========================================================================
                          zig-zap
========================================================================
Running 10s test @ http://127.0.0.1:3000
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.57ms    8.69ms 160.88ms   95.37%
    Req/Sec    58.60k    35.92k  250.13k    74.94%
  Latency Distribution
     50%  665.00us
     75%    1.13ms
     90%    3.79ms
     99%   45.87ms
  2327928 requests in 10.09s, 352.99MB read
  Socket errors: connect 0, read 387, write 0, timeout 0
Requests/sec: 230727.51
Transfer/sec:     34.99MB

On an E3-1275 v6

========================================================================
                          httpz
========================================================================
Running 10s test @ http://127.0.0.1:3000
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.28ms  377.20us  21.02ms   98.00%
    Req/Sec    78.78k     3.40k  103.86k    74.00%
  Latency Distribution
     50%    1.26ms
     75%    1.29ms
     90%    1.37ms
     99%    1.53ms
  3136631 requests in 10.03s, 161.53MB read
Requests/sec: 312587.49

========================================================================
                          zig-zap
========================================================================
Running 10s test @ http://127.0.0.1:3000
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.34ms    1.92ms  54.45ms   95.66%
    Req/Sec    93.75k     3.88k  111.99k    91.00%
  Latency Distribution
     50%    1.00ms
     75%    1.09ms
     90%    1.41ms
     99%   10.63ms
  3732001 requests in 10.05s, 565.90MB read
Requests/sec: 371458.89
Transfer/sec:     56.33MB

For a "hello world" example, the main thing you can do is tweak the worker count and thread pool size. But if you're running wrk on the same machine as the server, I don't think you have any cores to spare. I tried various settings for both, and they largely just hurt performance:

    var server = try httpz.Server(void).init(allocator, .{
        .port = 3000,
        .workers = .{.count = 2},
        .thread_pool = .{.count = 6},
    }, {});

Sep 21 '24 00:09 karlseguin

I'm starting to suspect/fear that httpz has some scaling issues. I tested it on a 32 vcpu cloud instance, and couldn't get it to scale linearly (or close to) with # of threads. Although, for the life of me, I can't figure out where the bottleneck is. Profiling showswritevas being the largest bottleneck. Gonna wait to see Anton's newest video, to see if he runs into the same thing...since his testing setup is better than mine.

Sep 21 '24 10:09 karlseguin

Super interesting!

im getting similar httpz numbers on both M2 pro and Ryzen 5/linux

but nowhere near 300 for zap - its more ballpark with the others. Impressive

you are making me want to get an M3 max !

Thanks for sharing! Interesting to see the M2 pro numbers!

Sep 22 '24 08:09 renerocksai

It is weird how different my results are.

On an m2:

[...]

Awesome! Thanks for trying it with your configurations! The httpz Linux Transfer/sec readings would have been interesting, got cut off, but nvm.

Looking at the differences on a Linux machine, httpz and zap don't seem that far off. Those are the only numbers that really matter IMHO because if you are serious about a server, you don't run it on a Mac - might be a hot take IDK.

For a "hello world" example, the main thing you can do is tweak the worker count and thread pool size. But if you're running wrk on the same machine as the server, I don't think you have any cores to spare. I tried various settings for both, and they largely just hurt performance:
    var server = try httpz.Server(void).init(allocator, .{
        .port = 3000,
        .workers = .{.count = 2},
        .thread_pool = .{.count = 6},
    }, {});

Yeah, you have to be careful not to allocate more cores than you have - and cores are not created equal, esp. on new Macs.

Sep 22 '24 08:09 renerocksai

I'm starting to suspect/fear that httpz has some scaling issues. I tested it on a 32 vcpu cloud instance, and couldn't get it to scale linearly (or close to) with # of threads. Although, for the life of me, I can't figure out where the bottleneck is. Profiling showswritevas being the largest bottleneck. Gonna wait to see Anton's newest video, to see if he runs into the same thing...since his testing setup is better than mine.

Very interesting! I have no clue wrt httpz either. Do you mean writev as bottleneck being a (syscall) contention bottleneck? Or is it time spent in writev?

Hypotheticals that come to mind: ... wait.

Actually, as food for thought, here are some ideas from ChatGPT :-)

Thread Contention on Shared Resources:

• File Descriptor Contention: If the worker threads are contending for access to shared file descriptors or other shared resources (e.g., logging, connection state), this can create bottlenecks. Even if epoll is only used for accepting connections, contention on these resources can slow down the overall processing.
CPU Cache Contention:

• Cache Line Contention: As the number of threads increases, the worker threads might start contending for CPU cache lines, especially if they are working on shared data structures or frequently accessing similar memory addresses. This can reduce performance and prevent linear scaling. • False Sharing: If threads are working on variables that are close together in memory but are supposed to be independent, they could cause false sharing, where updates to these variables cause unnecessary cache invalidations.
I/O Subsystem Bottlenecks:

• Network or Disk I/O Saturation: The worker threads are likely performing readv and writev on network sockets or disk files. If the I/O subsystem (network or disk) is saturated, adding more threads won’t increase throughput because the underlying hardware has reached its limit. • TCP/IP Stack Limits: On a heavily loaded server, the TCP/IP stack itself might become a bottleneck, especially if it’s handling a large number of connections. This can happen even if there are plenty of CPU cores available.
NUMA Effects:

• NUMA Node Misalignment: If your server is running on a NUMA (Non-Uniform Memory Access) architecture, threads might be running on different NUMA nodes and accessing memory that is not local to their node. This can significantly increase memory access latency and reduce scalability. Ensuring that threads are properly pinned to cores and that memory is allocated on the same NUMA node as the thread can help.
Load Imbalance:

• Uneven Workload Distribution: If some worker threads are processing more data or more complex requests than others, it can lead to load imbalance. This imbalance can cause some threads to become bottlenecks while others are underutilized, reducing overall scalability.
Thread Pool Overhead:

• Thread Management Overhead: The overhead of managing a large number of threads (e.g., context switching, task queue management) might increase as more threads are added. This overhead can limit scalability, especially if the thread pool implementation is not optimized for high concurrency.
Suboptimal Use of readv and writev:

• Inefficient Buffer Management: If the buffers used in readv and writev are not optimally sized or aligned, the performance benefits of these system calls may not be fully realized. This could lead to suboptimal I/O performance as the number of threads increases. • Partial Reads/Writes: If the worker threads are not handling partial reads/writes efficiently, this can lead to increased syscall overhead or I/O blocking, which can degrade performance as more threads are added.

Potential Solutions:

•	Optimize Resource Access: Minimize contention on shared resources by using thread-local storage, lock-free data structures, or reducing the critical section of code that needs to be synchronized.
•	NUMA Awareness: Ensure that threads are properly pinned to cores and that memory is allocated on the same NUMA node to reduce latency.
•	Balance Load: Implement or improve load balancing mechanisms to ensure that work is evenly distributed among worker threads.
•	Profile I/O Operations: Profile your I/O subsystem to identify any bottlenecks in the network or disk I/O, and optimize readv and writev usage by fine-tuning buffer sizes and ensuring efficient handling of partial reads/writes.
•	Reduce Thread Pool Overhead: Consider tuning the thread pool size and task management to reduce overhead. Sometimes, fewer, more efficiently managed threads can outperform a larger pool with more overhead.

Sorry, it's a bit verbose, but I find it did a great job anyway :-)

Sep 22 '24 09:09 renerocksai

closing this as I am getting rid of micro-benchmarks

Mar 30 '25 13:03 renerocksai