snmalloc
snmalloc copied to clipboard
metrics of snmalloc
Hi, I am implementing snmalloc support for an analytical database engine now. Everything works fine and the performance is really cool. But there is a problem on creating proper statistics of snmalloc:
Basically, I want something like resident memory and (de)committing information. Details like allocation size distribution can also be helpful but it is not an essence.
So I mimic the way of printing out the stats in snmalloc and wrote some code:
{
snmalloc::Stats stats;
snmalloc::current_alloc_pool()->aggregate_stats(stats);
using namespace snmalloc;
size_t current = 0;
size_t total = 0;
size_t max = 0;
static size_t large_alloc_max[NUM_LARGE_CLASSES]{0};
for (sizeclass_t i = 0; i < NUM_SIZECLASSES; i++)
{
if (stats.sizeclass[i].count.is_unused())
continue;
stats.sizeclass[i].addToRunningAverage();
auto size = sizeclass_to_size(i);
set(fmt::format("snmalloc.bucketed_stat_size_{}_current", size), stats.sizeclass[i].count.current);
set(fmt::format("snmalloc.bucketed_stat_size_{}_max", size), stats.sizeclass[i].count.max);
set(fmt::format("snmalloc.bucketed_stat_size_{}_total", size), stats.sizeclass[i].count.used);
set(fmt::format("snmalloc.bucketed_stat_size_{}_average_slab_usage", size), stats.sizeclass[i].online_average);
set(fmt::format("snmalloc.bucketed_stat_size_{}_average_wasted_space", size),
(1.0 - stats.sizeclass[i].online_average) * stats.sizeclass[i].slab_count.max);
current += stats.sizeclass[i].count.current * size;
total += stats.sizeclass[i].count.used * size;
max += stats.sizeclass[i].count.max * size;
}
for (uint8_t i = 0; i < NUM_LARGE_CLASSES; i++)
{
if ((stats.large_push_count[i] == 0) && (stats.large_pop_count[i] == 0))
continue;
auto size = large_sizeclass_to_size(i);
set(fmt::format("snmalloc.large_bucketed_stat_size_{}_push_count", size), stats.large_push_count[i]);
set(fmt::format("snmalloc.large_bucketed_stat_size_{}_pop_count", size), stats.large_pop_count[i]);
auto large_alloc = (stats.large_pop_count[i] - stats.large_push_count[i]) * size;
large_alloc_max[i] = std::max(large_alloc_max[i], large_alloc);
current += large_alloc;
total += stats.large_push_count[i] * size;
max += large_alloc_max[i];
}
set("snmalloc.global_stat_remote_freed", stats.remote_freed);
set("snmalloc.global_stat_remote_posted", stats.remote_posted);
set("snmalloc.global_stat_remote_received", stats.remote_received);
set("snmalloc.global_stat_superslab_pop_count", stats.superslab_pop_count);
set("snmalloc.global_stat_superslab_push_count", stats.superslab_push_count);
set("snmalloc.global_stat_segment_count", stats.segment_count);
set("snmalloc.global_stat_current_size", current);
set("snmalloc.global_stat_total_size", total);
set("snmalloc.global_stat_max_size", max);
}
I don't know. but maybe the above method would create too many entries in the summary?
And any suggestion on creating more concise async metrics for the allocator?
Another thing is that it is probably not a good idea to print out the statistics after thread exiting in this situation.
Another interesting part is that, if I enable stats, the dealloc routine would be taken up by costly average calculation. (I cannot provide further stack traces since the product has not been released as open source yet. sorry for that)
So those statistics are pretty heavy weight, and were not designed for production. More for working out what snmalloc is doing wrong. They have not really been maintained. There are very coarse statistics available from
https://github.com/microsoft/snmalloc/blob/6e638742e3c66549174d4c264bd05c9435938ac1/src/override/malloc-extensions.cc#L7-L12
This might be sufficient for what you are after. This is tracked all the time and is very cheap. It was considered the bare minimum for some other services.
With the rewrite on the snmalloc2
branch, I am about to investigate statistics tracking. So if you have requirements, I will try to work them into what I build.
I would provide some records from my side. This is from a analytical database engine (single node in this case). It took up almost all the system memory on linux (as it won't madvise them back).
The problem is, the server itself uses mmap/mremap for large allocations for get potential speedup from OS paging. So I am very concerned to have this de-commit pattern in a production env.
@schrodingerZhu are you able to try #404 for your use case? This getting pretty stable now, and should address your concern about holding on to OS memory.
What is the green line showing in the graph? RSS or Virtual memory usage?
according to the name of the metric, it should be RSS. I can also see htop show similar memory usage of my program with the green line.
Since all of my works now are in experimental mode, I would like to give snmalloc 2 and #404 a try. I could also report back the changes in performance and the metric.
Thanks for the suggestions! And the above results were still on snmalloc 1 and there was something like a tens of seconds performance bump on some TPCH workload when switching from jemalloc to snmalloc, which really made me astonished. Let's see what we can get with snmalloc 2.
I believe #404 is working since we can now see drops of RSS curve.
However, in this case:
as you can see after some peaks in the memory curve (it tried to acquire more than 169GiB!), the stats suddenly went to zero with snmalloc 2
. This means the engine is killed for OOM. Ouch, this is bad; with snmalloc 1
thou the space is not de-committed, I didn't experience OOMs.
The performance of snmalloc 2
degrading was still there, for those successful trials, I could see 100% slow down (from 30s to 1min) for some particular queries. I may provide some flamegraphs on snmalloc stacks when they are ready,
Oops, since I was running this on kernel 3.10 I guess madvise
with MADV_DONTNEED
was too much heavier than I have ever expected.
I think it is madvise
that took all the extra running time (up to 30s for that query) in this case.
I am going to look into a more consolidating calls to madvise
, which will hopefully reduce this cost.
So did it work in terms of reducing the memory usage, or did it regress the memory usage and get OOM. I was clear from your message?
- I can see decrement of the memory usage now: so madvise is working.
- even with the decrement above, I got a regression of OOM compared with
snmalloc 1
@schrodingerZhu would you be able to run this experiment again with the latest main branch? I have done a lot of work on bringing down the footprint, most examples are very close to snmalloc 1 now, so would be interested to know if I have fixed this.