Slowdown when using mimalloc to allocate (many) buffers
Hello,
I am playing with Apache Arrow, which is using mimalloc (by default), and noticed some weird performance that I could not explain and that goes away when switching to the system allocator. I extracted the core of the problem, shown in the following:
#include <chrono>
#include <cstddef>
#include <iostream>
#include <memory>
#include <vector>
static constexpr std::size_t NumPages = 1000;
static constexpr std::size_t NumElements = 64 * 1024 + 1;
int main(int argc, char *argv[]) {
std::vector<std::unique_ptr<float[]>> pages;
for (std::size_t i = 0; i < NumPages; i++) {
pages.emplace_back(new float[NumElements]);
}
auto start = std::chrono::steady_clock::now();
for (std::size_t e = 0; e < NumElements; e++) {
for (std::size_t i = 0; i < NumPages; i++) {
pages[i][e] = 0;
}
}
auto end = std::chrono::steady_clock::now();
const std::chrono::duration<double> duration = end - start;
std::cout << "run time: " << duration.count() << "\n";
return 0;
}
Compiling with g++ -O2 pages.cxx -o pages and running gives:
$ taskset -c 2 ./pages
run time: 0.674548
However, preloading mimalloc.so is 10x slower:
$ LD_PRELOAD=install/lib64/libmimalloc.so taskset -c 2 ./pages
run time: 7.09013
This is on Alma Linux 8 with a somewhat dated glibc 2.28 and Linux kernel 4.18.0-553.51.1.el8_10.x86_64. However, I can also reproduce a 3x slowdown on Arch Linux with a recent kernel:
$ taskset -c 2 ./pages
run time: 0.66908
$ LD_PRELOAD=/usr/lib64/libmimalloc.so taskset -c 2 ./pages
run time: 1.91931
The numbers above are from the current main (or v2.2.4 when using the Arch Linux package), but I can equally reproduce with v1.9.4 and latest dev3.
@hahnjo
hmm interestingly enough, /usr/lib/libjemalloc.so seems to outperform both mimalloc and the system allocator (tho do note that I didn't pin the test to a single cpu, because I'm lazy):
]$ LD_PRELOAD=/usr/lib/libjemalloc.so ./test
pages.capacity() = 1024
run time: 0.111928
$ LD_PRELOAD=/usr/lib/libmimalloc.so ./test
pages.capacity() = 1024
run time: 0.819127
$ ./test
pages.capacity() = 1024
run time: 0.253087
Tho it's also interesting that for me, mimalloc is only about 3x slower than the system allocator
Interesting observation, I can confirm. I'm interested on an explanation for the difference however I wanted to offer the following: The system allocator advantage goes away when increasing the number of pages 10x. They both take around the order of 15seconds for me at that point. This is a very rotten access pattern which I'm sure you know.
@albi-k Also, do note that doing a manual loop-interchange results in performance differences becoming negligible as well as reducing the runtime duration to below 100ms:
#include <chrono>
#include <cstddef>
#include <iostream>
#include <memory>
#include <vector>
static constexpr std::size_t NumPages = 1000;
static constexpr std::size_t NumElements = 64 * 1024 + 1;
int main(int argc, char *argv[]) {
std::vector<std::unique_ptr<float[]>> pages;
for (std::size_t i = 0; i < NumPages; i++) {
pages.emplace_back(new float[NumElements]);
}
auto start = std::chrono::steady_clock::now();
for (std::size_t i = 0; i < NumPages; i++) {
for (std::size_t e = 0; e < NumElements; e++) {
pages[i][e] = 0;
}
}
auto end = std::chrono::steady_clock::now();
const std::chrono::duration<double> duration = end - start;
std::cout << "run time: " << duration.count() << "\n";
return 0;
}
interchanging the loop removes all performance differences, because it makes the memory-access pattern more cache friendly
I'm now thinking that the difference in performance between malloc implementations is actually caused by cpu cache conflict misses see: https://en.algorithmica.org/hpc/cpu-cache/associativity/
Interesting observation, I can confirm. I'm interested on an explanation for the difference however I wanted to offer the following: The system allocator advantage goes away when increasing the number of pages 10x. They both take around the order of 15seconds for me at that point.
Indeed, something in the parameter space seems to be hitting a "sweet spot" where mimalloc performs badly.
This is a very rotten access pattern which I'm sure you know.
Yes I know, but it's somewhat representative of what we see when serializing data into a columnar format.
I'm now thinking that the difference in performance between malloc implementations is actually caused by cpu cache conflict misses see: https://en.algorithmica.org/hpc/cpu-cache/associativity/
Likely, however I would like to point out that neither malloc implemention is involved in the timing loop. Is it related to how the memory is allocated from the kernel and memory policies / advices that influence how first touch works?
@hahnjo
I'm now thinking that the difference in performance between malloc implementations is actually caused by cpu cache conflict misses see: https://en.algorithmica.org/hpc/cpu-cache/associativity/
Likely, however I would like to point out that neither malloc implemention is involved in the timing loop. Is it related to how the memory is allocated from the kernel and memory policies / advices that influence how first touch works?
Basically, what I’m saying is that the way that different malloc implementations place the float arrays in memory, is what’s then affecting cpu cache performance, since that then dictates the the physical and virtual memory addresses of said floats that’s seen by the cpu cache
Basically, the benchmarked/timed loop is iterating over 1/(8 to 64) memory’s worth of cpu cache entries And what makes a difference in performance in this case, is how many cache entries remain in the cpu cache after the end of the inner loop Because modern cpu caches are set associative, the Ith cache entry from iteration N of the inner loop may be evicted during iteration N+1 by the (I-x)th cache entry, if the two cache entries resolve to the same set
Reason why I think this is what’s making the performance difference, is that running the test under perf stat showed higher cache misses (especially for l1 & l2), and I also found that the difference between the lowest and highest allocated float addr was actually bigger under jemalloc compared to both mimalloc and glibc malloc
I’m guessing that there’s some difference in something like alignment or some inserted padding somewhere w/ jemalloc that’s at play I’m also not sure whether jemalloc is intentionally designed to try to avoid placing same-sized allocations at powers-of-two distances apart from each other, in order to avoid cache conflict misses. Either way, if this is indeed due to cache conflict misses, this will definitely be tricky to mitigate without consuming a non-negligible amount of extra memory
Hmmm I might also want to test w/ rpmalloc as well, since I think rpmalloc likes to over-align normal allocations
EDIT: I think I found what jemalloc is doing to avoid cache misses: https://github.com/jemalloc/jemalloc/commit/8a03cf039cd06f9fa6972711195055d865673966
Note that the juicy part that does the index randomization was moved to jemalloc/include/jemalloc/internal /arena_inlines_b.h in commit https://github.com/jemalloc/jemalloc/commit/585f92505521136157aad8ac2e9288609127f863
It definitely seems that jemalloc’s "cache index" randomization has the downside of causing some memory to be wasted: seems that it can cause aclarge sizeclass allocation to consume up to an entire extra page.
For mimalloc, I’m thinking that you could adopt a more memory efficient version of "cache index randomization", by, if this doesn’t already exist, adding one bitfield variable to mimalloc’s thread local struct, which holds the sizeclass of the previous allocation Wherein "cache index randomization" is only applied if the sizeclass of the current requested malloc size is equal to the sizeclass of the previous allocation (for the given thread) With the idea being that allocations of the same size[class] that are requested consecutively are more likely to belong to the same vector/array/group/list, and that this pattern of cache misses probably occurs more frequently when iterating over allocations of similar sizes
Note: seems that the url on jemalloc github for the issue post about "cache index randomization" linking to the original paper is dead Here’s a working link: https://dl.acm.org/doi/10.1145/1993478.1993486
Yes, I was partially aware of a potential interplay with set associativity. That's why I made it NumElements = 64 * 1024 + 1, in the hope that the additional element shifts the start offset for every allocation.
However, I just found that mimalloc heavily overaligns allocations: If I print the allocated pointers, they are all aligned on 0x(1)0000 boundaries (equivalent to 16384 elements). For system malloc (on EL8), the alignment is 0x(1)000 (with a shift of 0x10) which is equivalent to 1024 elements.
If I have some time, I will try overallocating and then manually shifting the offset within the returned memory region. I would expect this to "resolve" the difference in performance.
@hahnjo it appears that the most active development mimalloc branch dev3-cdb has recently merged in some commits to reduce memory waste, which I would assume may mean that some allocations would be less overaligned:
https://github.com/microsoft/mimalloc/commits/dev3-cdb/
haven't checked it myself tho...