jj icon indicating copy to clipboard operation
jj copied to clipboard

FR: investigate a new memory allocator

Open thoughtpolice opened this issue 2 years ago • 13 comments

glibc has a notoriously "middle of the road" malloc(3) implementation. It is hard to see when using short-lived programs that do all their work and exit, but in reality, today, there is essentially no benchmark or workload in which it will ever be superior to an alternative like mimalloc, jemalloc, or (modern, non-gperftools) tcmalloc; long running, short running, big working set size, small, it doesn't matter. Not to mention it doesn't return memory to the system very gracefully, and the default configuration actually wastes tons of memory, which will eventually become extremely problematic if we ever get some long running components like file watching tools, daemons, etc. "Replace glibc malloc" is typically one of the easiest low-hanging wins you can get, that I often remember for most of my own programs.

I made a small experimental patch to HEAD that switched the default global_alloc to mimalloc (through the mimalloc crate), and saw anywhere from a 2-9% performance improvement on some very tiny microbenchmarks when playing with linux.jj and nixpkgs.jj, but it might be noisy (gecko-dev TBD). I think mimalloc is probably the best choice for its extreme ease of integration (about ~10 .c files, so we could just include it manually), cross platform support, simple and elegant design to the point it's small and we could understand it fully, and generally excellent performance everywhere. However the mimalloc crate is stuck on an older version of the upstream C library, while I'd like us to use the 2.x branch with secure freelists enabled. So I need to play around with it sometime this week. (I personally think vendoring things like this is OK, but I understand I'm probably alone on this.)

thoughtpolice avatar Nov 01 '23 12:11 thoughtpolice

Also, it might very well improve performance on Windows and macOS, too. Ideally we would just enable it everywhere I guess; that makes support and QA a lot easier when we don't have to remember weird platform-specific inconsistencies like that.

thoughtpolice avatar Nov 01 '23 12:11 thoughtpolice

And actually, I suspect the case with musl is vastly improved as well; musl's allocator is notoriously bad for anything with multiple threads. I think wasmtime suffers something like a 20x performance loss from the built in allocator in threaded workloads, when compared to mimalloc, when statically linked with musl: https://www.tweag.io/blog/2023-08-10-rust-static-link-with-mimalloc/

thoughtpolice avatar Nov 01 '23 12:11 thoughtpolice

I wonder why the various authors don't submit an update for glibc malloc, and improve everything that uses it.

joyously avatar Nov 01 '23 13:11 joyously

I think that the reality is that it probably remains the way it is due to a mixture of politics, willingness, and overall trajectory of the project(s) in question; not just glibc, but also the memory allocators themselves and the goals of their authors e.g. I hardly think glibc is concerned with designing a cross-platform memory allocator for Windows/macOS. Realistically everything can always be improved, it's a matter of time. But also, memory allocators are something that are deeply, notoriously workload-dependent and it isn't always clear that some optimizations are the right ones; for example, lock contention (multithreading) isn't always the most important consideration, sometimes it's how efficient free() is for a size class or remote thread, or how cache efficient the design is, or how well it handles long running fragmentation for a good spread of size classes.

For example, Facebook relies heavily on jemalloc because, for their workloads, it is simply the best allocator there is, where sized delete operators are a critical path to optimize for. As their workloads change, they can continue to monitor and adapt the design to handle those things. Likewise for Google, who authored tcmalloc, which is NUMA and Transparent-HugePage aware. Similarly, mimalloc aims to be cross platform, scale well with threads, have excellent cache locality and ultimately be simple in design and practice.

glibc's malloc optimizes for... Well, I don't really know. Desktop Linux? As an example of this, it uses a lot of memory by default because it allocates N*8 "pools" of memory where N is the number of CPUs you have. It does this for each thread. When you try to allocate memory in a thread, glibc will create a pool for that thread immediately without doing anything else, up to N*8 times total. Each one of these arenas has a fixed size. This means an application might use a really huge amount of memory for nothing if it has few threads but a lot of allocations. The goal is to reduce lock contention, by "sharding", and having private pools for each thread to take free slots from. But this is actually an anti-pattern in many ways; in a world where everyone uses containers, you're often allocated very small amounts of memory and given very small amounts of CPU. glibc's design is a waste if you are using it for a docker container where you only allocate 128MiB of memory and 1 vCPU, but have 12 CPU cores! And a lot of those "detect number of CPUs" algorithms aren't cgroup-aware. So you have to tune all this manually. Actually, it isn't the first time glibc's primitives have been beaten handily; I seem to remember there were, for many years, lots of low hanging fruits in string processing routines, memcpy/memset, etc.

And, many workloads just need different APIs for performance, correctness, and observability concerns, that the standard malloc cannot provide in any meaningful way. mimalloc and jemalloc for example provide many non-standardized APIs for thread-specific heaps, statistics gathering, arenas, and more. These may be critical for applications in production. Putting all these non-standard features inside glibc is potentially a bad idea because they will have to be supported for ~eternity (glibc has extremely strict ABI compatibility), whereas third party memory allocators can be bundled with the application. So, if glibc ever wanted to rewrite their allocator, they may be stuck with the old one ~forever, anyway, because applications are built against those non-standard interfaces. That would be a huge maintenance burden.

Finally, there's an easy and practical reason to like this setup: because you don't want memory allocation performance to be tied to libc! For example, your application might perform totally differently with musl, because of malloc. Having a third party memory allocator means you can use it with either musl or glibc, achieving more consistent performance and behavior.

This isn't to say glibc's allocator will never be good or that the maintainers don't want it to be better. But malloc implementations are often extremely policy and workload dependent. While the ones I listed are general, overall improvements over the status quo on Linux — for almost all workloads — I can assure you the last memory allocator has not yet been written.

thoughtpolice avatar Nov 01 '23 13:11 thoughtpolice

Tangentally related, using SmallVec/CompactString can save allocation cost.

For example, with CompactString-backed RepoPath, wc snapshotting appears to be faster. (no watchman enabled.)

yuya ~/mirrors/linux % hyperfine --warmup 3 --runs 10 "/tmp/jj-baseline files no-match" "/tmp/jj-cstr files no-match"
Benchmark 1: /tmp/jj-baseline files no-match
  Time (mean ± σ):     700.3 ms ±  23.6 ms    [User: 1342.2 ms, System: 765.2 ms]
  Range (min … max):   665.0 ms … 741.3 ms    10 runs
 
Benchmark 2: /tmp/jj-cstr files no-match
  Time (mean ± σ):     573.9 ms ±  32.4 ms    [User: 1231.6 ms, System: 765.2 ms]
  Range (min … max):   533.1 ms … 625.1 ms    10 runs
 
Summary
  /tmp/jj-cstr files no-match ran
    1.22 ± 0.08 times faster than /tmp/jj-baseline files no-match
@@ -25,7 +26,7 @@
 content_hash! {
     #[derive(Clone, PartialEq, Eq, PartialOrd, Ord, Debug, Hash)]
     pub struct RepoPathComponent {
-        value: String,
+        value: CompactString,
     }
 }

A borrowed RepoPath<'a> type might also help.

yuja avatar Nov 02 '23 01:11 yuja

Speaking of that, I've been wondering if we should replace Merge::{adds,removes} by a single list to reduce memory use and allocations. Most of the time it's a single element. Using SmallVec there is probably also a good idea.

martinvonz avatar Nov 02 '23 01:11 martinvonz

I've been wondering if we should replace Merge::{adds,removes} by a single list to reduce memory use and allocations.

I considered that too, (and have unfinished patches to store interpolated list instead of adds/removes.) That's doable, but we'll need to adjust trivial_merge() interface.

yuja avatar Nov 02 '23 01:11 yuja

FWIW, the snapshotting suffers badly enough from lock contention in the allocator that using RAYON_NUM_THREADS=1 makes almost no difference in wall time, but saves a lot in user and sys time:

$ time jj show > /dev/null

real	0m0.233s
user	0m0.459s
sys	0m0.676s

$ time RAYON_NUM_THREADS=16 jj show > /dev/null

real	0m0.240s
user	0m0.554s
sys	0m0.621s

$ time RAYON_NUM_THREADS=8 jj show > /dev/null

real	0m0.223s
user	0m0.335s
sys	0m0.219s

$ time RAYON_NUM_THREADS=2 jj show > /dev/null

real	0m0.234s
user	0m0.187s
sys	0m0.102s

$ time RAYON_NUM_THREADS=1 jj show > /dev/null

real	0m0.259s
user	0m0.165s
sys	0m0.093s

Perf data (top 5 entries):

Default:
  32.57%  jj       [kernel.kallsyms]  [k] native_queued_spin_lock_slowpath                                                                                                                                                                     
  14.86%  jj       jj                 [.] __lock                                                                                                                                                                                               
   3.46%  jj       [kernel.kallsyms]  [k] futex_wake                                                                                                                                                                                           
   3.42%  jj       [kernel.kallsyms]  [k] entry_SYSRETQ_unsafe_stack                                                                                                                                                                           
   3.39%  jj       jj                 [.] __unlock                                                                                                                                                                                             

Single threaded:
   6.11%  jj       [kernel.kallsyms]  [k] __lruvec_stat_mod_folio                                                                                                                                                                              
   6.02%  jj       [kernel.kallsyms]  [k] syscall_exit_to_user_mode                                                                                                                                                                            
   5.63%  jj       jj                 [.] memcpy                                                                                                                                                                                               
   5.50%  jj       jj                 [.] std::path::compare_components                                                                                                                                                                        
   5.49%  jj       jj                 [.] globset::GlobSet::matches_candidate_into

dotdash avatar Jan 16 '25 11:01 dotdash

FWIW, this contention does not show up if I build jj locally and link against glibc

dotdash avatar Jan 22 '25 11:01 dotdash

From kurt in the lazyjj channel in discord: https://nickb.dev/blog/default-musl-allocator-considered-harmful-to-performance/

In lazyjj having a locally built jj (with glibc) is so much more usable than the upstream build using musl in even medium-sized repositories.

Is there anything that prevents jj from just using jemalloc? Even if it's a stopgap solution?

dotdash avatar Feb 21 '25 00:02 dotdash

Jemalloc is officially abandoned now, so I don't think we should move to it for that reason alone.

steveklabnik avatar Dec 02 '25 23:12 steveklabnik

Jemalloc is officially abandoned now, so I don't think we should move to it for that reason alone.

I mean the upstream JeMalloc is frozen, facebook has a downstream which is still developed. Also there aren't that many good allocators left.

PhilipMetzger avatar Dec 02 '25 23:12 PhilipMetzger

Ah, I hadn't heard that they kept development going, given that they were the ones working on it at the time it went archived! That is a little different then, at least.

steveklabnik avatar Dec 03 '25 02:12 steveklabnik