Performance Explanation
Hello,
why is zap performance measured so poorly on this site? https://www.techempower.com/benchmarks/#section=data-r23
I don't know. But when briefly checking out the source code, I noticed:
They use a GeneralPurposeAllocator (now named DebugAllocator) as their main allocator, with .{ .thread_safe = true }. Then they wrap this already thread-safe allocator in a ThreadSafeAllocator. The GPA already isn't the quickest allocator ; std.heap.smp_allocator is supposed to be a quicker alternative.
Then, they hardcode 128 worker threads into the process, on a (if I looked it up correctly) 56-core machine. But that's not all. They then start one such 128-worker-thread process for each CPU core (56 if reported correctly). Totalling at least 56x128 = 7168 worker threads on a 56-core machine.
Next, they wrap all endpoints in a MiddleWare that just adds a Server: Zap field to the response and provides some random number generator and a Postgres connection pool. For text/json endpoints, I think that's an overkill.
All in all, I think performance could be improved by having just 1 process with maybe n_cores * 1.0 .. 1.25 threads, at least for the non-postgres endpoints. Going higher probably helps with the postgres endpoint, as waiting for DB responses is likely to cause the OS to switch thread. Not using the middleware for endpoints that don't use postgres would avoid a few call redirections per request. Using a faster allocator (like the smp_allocator) would also help.
Oh, I just noticed that they use a mutex for mustache.build and for the random number generator. As for the random number generator, I think just instantiating one per request where it's needed would be faster than all threads fighting for a single instance, blocking each other via a mutex. Why they mutex-protect the call to building the mustache template, I don't get. It just makes all the threads serialize on mutexes all the time.
So there are a few ways to make it faster.