zig implement a fast general purpose allocator

trafficstars

Requirements:

faster than glibc malloc
faster than musl malloc
faster than mingw-w64 malloc
faster than jemalloc
faster than mimalloc

Once it has been verified for correctness, fuzz tested, and tested with various third party applications such as web browsers using LD_PRELOAD, it should become the default implementation used by std.heap.GeneralPurposeAllocator when optimization mode is ReleaseFast.

Good luck, have fun

Aug 19 '22 06:08 andrewrk

I love this idea

WebKit's memory allocator is interesting https://github.com/WebKit/WebKit/blob/main/Source/bmalloc/libpas/Documentation.md

I think Zig could do something especially cool here since it knows the type being allocated at comptime

Aug 19 '22 09:08 Jarred-Sumner

faster than [all other mallocs]

This may be tough to do 😉

Aug 19 '22 15:08 Techcable

I decided to take a look at recent publications on memory allocators and I stumbled upon this masters thesis: https://uwspace.uwaterloo.ca/handle/10012/18329

The author experimented with their own memory allocator, but besides that they did a pretty thorough study on performance of other popular and state of the art memory allocators. In the conclusion the author states that rpmalloc often landed as the fastest memory allocator.

The rpmalloc project seems pretty sensible, so it might be a good idea to bring it's design over to Zig.

https://github.com/mjansson/rpmalloc

Aug 20 '22 23:08 cipharius

I think starting with a port of rpmalloc would be an excellent strategy.

Aug 20 '22 23:08 andrewrk

I've been further investigating state of allocators and snmalloc caught my attention. It's up there with rpmalloc performance, but it's message passing design seems solid for multi-threaded allocation/deallocation pipelines and unlike rpmalloc it has a paper describing it's design details, so don't have to rely only on the source code.

Anyone got opinion on snmalloc?

Sep 02 '22 16:09 cipharius

I feel like mesh would also be an interesting contender for this. It manages to reduce the memory footprint significantly, and feels like it would fit zig.

https://github.com/plasma-umass/Mesh https://www.youtube.com/watch?v=xb0mVfnvkp0&ab_channel=MicrosoftResearch

Edit: Probably disqualified because it heavily relies on virtual memory, which makes it unsuitable for use in wasm.

Sep 08 '22 10:09 somethingelseentirely

It's ok to use virtual memory. Wasm can use a different implementation

Sep 19 '22 20:09 andrewrk

What is wrong with porting mimalloc? Why it can't be used?

Oct 26 '22 22:10 darowny

mimalloc looks fine. Porting it would be a great start.

Oct 26 '22 22:10 andrewrk

If it my belief that good engineering is finding the best tradeoff. In this spirit, we should always optimize for something. In most cases we should optimize for simplicity.

Optimizing for speed, as suggested here, is in my opinion contradictory with the nature of a general-purpose allocator.

Maybe the goal should be more clearly defined, like optimizing for reasonable speed in the worst case, or the average case, but given the wide range of possible workloads for a general-purpose allocator, even defining what could be the worst case might be difficult or impossible.

If I had to write a general-purpose allocator the design goals would be simple, slow, no surprises.

Nov 11 '22 11:11 stephc-int13

The existing GeneralPurposeAllocator implementation basically checks those boxes: it's fairly simple, and has debugging advantages, but it's slow. But my understanding is that this implementation wouldn't be going anywhere - it'd still be used in safe release modes. The fast one, as the issue says, would specifically be for ReleaseFast builds. That means you get the best of both worlds - safety and simplicity while developing and debugging, and good runtime performance if and when you need it.

I agree the goal of speed could be better defined, but imo it's clear enough that we do need some faster GPA implementation, as the current one has performance issues which I myself have hit before (which have caused me to have to fallback to c_allocator sometimes). Having an allocator in the standard library with good performance characteristics for normal use cases seems like a pretty reasonable ask.

Nov 12 '22 23:11 mlugg

Keeping security in mind while developing the new one would be nice. There have been plenty of CVE's related to flawed heap allocators.

Some example attack vectors for inspiration:

how2heap: https://github.com/shellphish/how2heap

Owning Firefox's heap by exploiting jemalloc: https://speakerdeck.com/argp/exploiting-the-jemalloc-memory-allocator-owning-firefoxs-heap?slide=16

How to Exploit Dlmalloc Unlink: https://www.davidxia.com/2020/04/how-to-exploit-dlmalloc-unlink/

tcmalloc attacks: https://blog.infosectcbr.com.au/2019/12/attacks-on-tcmalloc-heap-allocator.html

mimalloc attacks: https://github.com/microsoft/mimalloc/issues/372

Nov 14 '22 09:11 cryptocode

If you are interested in adding security related features, you might want to look at some of the research we have done on snmalloc: https://github.com/microsoft/snmalloc/blob/main/docs/security/README.md

Most of the ideas are portable, but certain design choices make that easier than others.

Nov 14 '22 09:11 mjp41

OpenBSD's malloc may be a good starting point until we implement something more complicated.

A deep dive into OpenBSD malloc internals

It's clean and short, fits in a single file, and doesn't seem to do anything that would be difficult to port to Zig: https://github.com/openbsd/src/blob/master/lib/libc/stdlib/malloc.c

Of course, it comes with security features, separates data and metadata, and leverages randomization as much as possible.

Single-thread performance is good. Mutexes will hurt in highly concurrent scenarios, but Zig makes it easy to have per-thread allocators.

It can be replaced with something more complex later on.

/cc @omoerbeek

Nov 14 '22 12:11 jedisct1

https://blog.chromium.org/2021/04/efficient-and-safe-allocations-everywhere.html

PartitionAlloc is Chromium’s memory allocator, designed for lower fragmentation, higher speed, and stronger security and has been used extensively within Blink (Chromium’s rendering engine). In Chrome 89 the entire Chromium codebase transitioned to using PartitionAlloc everywhere (by intercepting and replacing malloc() and new) on Windows 64-bit and Android. Data from the field demonstrates up to 22% memory savings, and up to 9% improvement in responsiveness and scroll latency of Chrome.

It does not have a lot of security features but it seems simple

Nov 21 '22 19:11 ghost

mimalloc looks fine. Porting it would be a great start.

I had a question on this. Do we absolutely need to port mimalloc to zig as opposed to a Zig wrapper? The reason I ask is mimalloc is still maintained and looks like well written C code. Any improvements and/or bug fixes will go into the upstream version and a Zig port will need to track and port over the changes.

On the flip side it doesn't look like a lot of changes are going in, so a port wouldn't be hard to maintain and has a better chance of landing as a GPA in the zig source base.

Nov 24 '22 07:11 rganesan

I had a question on this. Do we absolutely need to port mimalloc to zig as opposed to a Zig wrapper?

The new allocator is supposed to be a fast default allocator for ReleaseFast, and I assume wrapping mimalloc would require libc which sounds like a no-go.

Nov 24 '22 10:11 cryptocode

On Thu, Nov 24, 2022 at 3:35 PM cryptocode @.***> wrote:

I had a question on this. Do we absolutely need to port mimalloc to zig as opposed to a Zig wrapper?

The new allocator is supposed to be a fast default allocator for ReleaseFast, and I assume wrapping mimalloc would require libc which sounds like a no-go.

Oh, good point. I didn't think of that. I think mimalloc and rpmalloc seem to be the best candidates for an initial port for two reasons. They're written in C and the codebases are relatively small. I'd like to give this a shot but not duplicate the effort if anyone is already working on a port of either of these two.

Ganesan

-- Ganesan Rajagopal

Nov 24 '22 14:11 rganesan

Most allocators depend very little on libc. Even though snmalloc is written in C++ it doesn't need a C++ library at runtime. If you want to LD_PRELOAD, then you can't touch very much libc or libcxx and not run into trouble.

With snmalloc, almost all the libc dependencies are in confined to two places, a platform abstraction, and the thread local state:

https://github.com/microsoft/snmalloc/tree/main/src/snmalloc/pal
https://github.com/microsoft/snmalloc/blob/main/src/snmalloc/global/threadalloc.h

This means we can run as the allocator inside a (Uni)kernel, inside SGX enclaves, and on other non-standard platforms where we can't depend on libc existing, or very much libc existing.

Nov 24 '22 14:11 mjp41

I've come up with what is basically a working port of rpmalloc: https://github.com/InKryption/rpmalloc-zig-port

I still have yet to benchmark it - hopefully it should yield similar results to the source material.

The only unfortunate thing about it is that the design seems to rely pretty heavily on global state, and uses a threadlocal variable, so I'm not 100% certain on how this could integrate into GeneralPurposeAllocator (whose state exists independently per instantiation). Any feedback is welcome.

Dec 01 '22 22:12 InKryption

On Fri, Dec 2, 2022 at 4:07 AM InKryption @.***> wrote:

I've come up with what is basically a working port of rpmalloc: https://github.com/InKryption/rpmalloc-zig-port

I still have yet to benchmark it - hopefully it should yield similar results to the source material.

This is nice. Thank you for doing this. I thought mimalloc would be a better choice for GPA compared to rpmalloc if only because it has additional debug capabilities (like guard pages) which would be nice for debug builds.

However mimalloc has a lot more moving parts and I thought it would be worth attempting an rpmalloc port myself. I think it would make an excellent choice for release builds.

The only unfortunate thing about it is that the design seems to rely pretty heavily on global state, and uses a threadlocal variable, so I'm not 100% certain on how this could integrate into GeneralPurposeAllocator (whose state exists independently per instantiation).

Wouldn't RPMALLOC_FIRST_CLASS_HEAPS in the original C implementation address this?

Any feedback is welcome.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

-- Ganesan Rajagopal

Dec 02 '22 14:12 rganesan

Wouldn't RPMALLOC_FIRST_CLASS_HEAPS in the original C implementation address this?

After having spent a reasonable amount of time sleuthing the code structure, it unfortunately doesn't seem like it. First class heaps are still tied into the same general structure involving the thread local state, so ultimately I decided to omit porting that feature, as I don't see a viable way that it could be used in the Allocator interface, so it ends up just being extra overhead.

Dec 02 '22 16:12 InKryption

On Fri, Dec 2, 2022 at 9:37 PM InKryption @.***> wrote:

Wouldn't RPMALLOC_FIRST_CLASS_HEAPS in the original C implementation address this?

After having spent a reasonable amount of time sleuthing the code structure, it unfortunately doesn't seem like it. First class heaps are still tied into the same general structure involving the thread local state, so ultimately I decided to omit porting that feature, as I don't see a viable way that it could be used in the Allocator interface, so it ends up just being extra overhead.

I see. Another reason first class heaps is a no go is because of the assumption that only one thread can access the heap functions at a time. We'll have to lock and that defeats the purpose of a lockfree implementation.

However, I didn't understand what's the concern with thread local state. If I understand your implementation correctly, you have already made thread_heap non-global and return it as part of the Allocator struct. I'm new to Zig, so please excuse me if I'm missing something obvious.

Ganesan

-- Ganesan Rajagopal

Dec 03 '22 05:12 rganesan

On Fri, Dec 2, 2022 at 4:07 AM InKryption @.***> wrote:

I've come up with what is basically a working port of rpmalloc: https://github.com/InKryption/rpmalloc-zig-port

I still have yet to benchmark it - hopefully it should yield similar results to the source material.

The only unfortunate thing about it is that the design seems to rely pretty heavily on global state, and uses a threadlocal variable, so I'm not 100% certain on how this could integrate into GeneralPurposeAllocator (whose state exists independently per instantiation). Any feedback is welcome.

I want to revisit this question as I'm looking at wrapping mimalloc and it also relies pretty heavily on global slate. Both rpmalloc and mimalloc have dedicated page pools for different bucket sizes. They also rely on per-thread heaps for performance reasons. Maintaining a per-instance state would be quite expensive IMHO.

Is there a good argument to maintain state independently per instantiation? I can see that being useful and necessary for something like an arena allocator. But rpmalloc/mimalloc are meant to be general purpose allocators, can't they be treated as singletons?

Ganesan

Ganesan Rajagopal

Dec 09 '22 16:12 rganesan

As far as I know, the best allocator is the one that best suits your particular application's needs. Hence why there are so many projects that have a custom allocator that defeats every alternative, for their project. Here's one that has yet to be mentioned: Roblox's custom Lua(u) allocator (I think the source files are here and here) that beats rpmalloc, mimalloc, jemalloc, ptmalloc and tcmalloc through domain-specific tuning (source). Thus, I think there should be an allocator which decides which strategies to use based on how it is actually used by the application.

I don't know much about allocators, but I imagine the Zig allocator should be able to determine at compile-time what the allocation workload is like by looking at allocation logs from previous runs and generating a tuned allocator specifically to meet those needs. That way, everyone gets the best possible allocator for their use-case?

Jan 20 '23 14:01 Validark

As far as I know, the best allocator is the one that best suits your particular application's needs. Hence why there are so many projects that have a custom allocator that defeats every alternative, for their project. Here's one that has yet to be mentioned: Roblox's custom Lua(u) allocator (I think the source files are here and here) that beats rpmalloc, mimalloc, jemalloc, ptmalloc and tcmalloc through domain-specific tuning (source). Thus, I think there should be an allocator which decides which strategies to use based on how it is actually used by the application.

I don't know much about allocators, but I imagine the Zig allocator should be able to determine at compile-time what the allocation workload is like by looking at allocation logs from previous runs and generating a tuned allocator specifically to meet those needs. That way, everyone gets the best possible allocator for their use-case?

This doesn't make too much sense since Zig is already designed around idea of customizable allocators. If you need a specialized allocator you can already write one and easily use it in some parts or in whole Zig project.

This thread is specifically dedicated to GeneralPurposeAllocator ReleaseFast mode, so it's supposed to be generic solution for maximum performance target, which may neglect safety features for sake of performance and work well in most cases.

Jan 20 '23 16:01 cipharius

Note: Not enough baggage to argue about.

I recently viewed a benchmark of memory allocations apparently performed in the Windows environment.

I chose five allocators to compare.

dlmalloc – previously popular

jemaloc – industrial strength, tuned for long running servers

mimalloc – general purpose allocator by Microsoft

rpmalloc – lockfree GPA by a now Epic Games employee

tlsf – pool allocator used in some games

https://www.forrestthewoods.com/blog/benchmarking-malloc-with-doom3/

Feb 01 '23 00:02 kassane

I'd like to note that it's very easy to write an allocator that's much, much faster than the current GeneralPurposeAllocator. My BinnedAllocator (<200 lines, excluding tests) is an order of magnitude faster, and can trade blows with glibc malloc and even mimalloc on some benchmarks.

I'd like to do more benchmarking to get better real-world performance numbers but haven't gotten around to it. If anyone can recommend a nice allocator benchmark to translate into Zig that'd be great!

Given that it's very easy to outspeed GPA, I'd suggest adding something to the standard library that's at least somewhat usable for performance-conscious applications, even if it's not as fast as more complex options. Those more advanced allocators can be discussed and implemented later, and can replace the simpler implementation. I'm happy to PR my BinnedAllocator if that's wanted :)

Feb 01 '23 00:02 silversquirl

I'd also like to suggest renaming the current GeneralPurposeAllocator to DebugAllocator and introducing a new FastAllocator (or maybe DebugGeneralAllocator and FastGeneralAllocator?). std.heap.GeneralPurposeAllocator can then refer to either DebugAllocator or FastAllocator depending on the build mode.

This allows people to choose which one to use if they want, and separates the code for the two allocators (as I doubt much of it will be shared).

Feb 01 '23 00:02 silversquirl

Andrew has previously mentioned that use of the term fast is to be discouraged (https://github.com/ziglang/zig/pull/13002#issuecomment-1264287533), and I agree with him; it's not a helpful term in this case, as it tells you nothing about the functionality or intended usage of the allocator (don't I ideally want all my allocations to be fast?). Renaming the current allocator to DebugAllocator probably makes sense (making it fast in release builds seems like it might be a lost cause without a lot of release-specific logic), but I think the alternative(s) should have more descriptive name(s), like your BinnedAllocator.

On that note, I think that allocator would probably be a good fit for std, and as a temporary replacement for GPA in release modes until we have a better alternative. The trivial pathological cases space-wise present in binned allocators mean that, in my opinion, they're not a great fit for std's general purpose allocator long-term, but release-mode GPA is borderline unusable in any application with a lot of allocation at the minute, so BinnedAllocator would certainly be a good temporary solution.

Feb 01 '23 03:02 mlugg

zig zig copied to clipboard

implement a fast general purpose allocator

Ganesan

zig
zig copied to clipboard