snmalloc icon indicating copy to clipboard operation
snmalloc copied to clipboard

metrics of snmalloc

Open SchrodingerZhu opened this issue 2 years ago • 12 comments

Hi, I am implementing snmalloc support for an analytical database engine now. Everything works fine and the performance is really cool. But there is a problem on creating proper statistics of snmalloc:

Basically, I want something like resident memory and (de)committing information. Details like allocation size distribution can also be helpful but it is not an essence.

So I mimic the way of printing out the stats in snmalloc and wrote some code:

    {
        snmalloc::Stats stats;
        snmalloc::current_alloc_pool()->aggregate_stats(stats);

        using namespace snmalloc;

        size_t current = 0;
        size_t total = 0;
        size_t max = 0;
        static size_t large_alloc_max[NUM_LARGE_CLASSES]{0};

        for (sizeclass_t i = 0; i < NUM_SIZECLASSES; i++)
        {
            if (stats.sizeclass[i].count.is_unused())
                continue;

            stats.sizeclass[i].addToRunningAverage();

            auto size = sizeclass_to_size(i);
            set(fmt::format("snmalloc.bucketed_stat_size_{}_current", size), stats.sizeclass[i].count.current);
            set(fmt::format("snmalloc.bucketed_stat_size_{}_max", size), stats.sizeclass[i].count.max);
            set(fmt::format("snmalloc.bucketed_stat_size_{}_total", size), stats.sizeclass[i].count.used);
            set(fmt::format("snmalloc.bucketed_stat_size_{}_average_slab_usage", size), stats.sizeclass[i].online_average);
            set(fmt::format("snmalloc.bucketed_stat_size_{}_average_wasted_space", size),
                (1.0 - stats.sizeclass[i].online_average) * stats.sizeclass[i].slab_count.max);
            current += stats.sizeclass[i].count.current * size;
            total += stats.sizeclass[i].count.used * size;
            max += stats.sizeclass[i].count.max * size;
        }

        for (uint8_t i = 0; i < NUM_LARGE_CLASSES; i++)
        {
            if ((stats.large_push_count[i] == 0) && (stats.large_pop_count[i] == 0))
                continue;

            auto size = large_sizeclass_to_size(i);
            set(fmt::format("snmalloc.large_bucketed_stat_size_{}_push_count", size), stats.large_push_count[i]);
            set(fmt::format("snmalloc.large_bucketed_stat_size_{}_pop_count", size), stats.large_pop_count[i]);
            auto large_alloc = (stats.large_pop_count[i] - stats.large_push_count[i]) * size;
            large_alloc_max[i] = std::max(large_alloc_max[i], large_alloc);
            current += large_alloc;
            total += stats.large_push_count[i] * size;
            max += large_alloc_max[i];
        }

        set("snmalloc.global_stat_remote_freed", stats.remote_freed);
        set("snmalloc.global_stat_remote_posted", stats.remote_posted);
        set("snmalloc.global_stat_remote_received", stats.remote_received);
        set("snmalloc.global_stat_superslab_pop_count", stats.superslab_pop_count);
        set("snmalloc.global_stat_superslab_push_count", stats.superslab_push_count);
        set("snmalloc.global_stat_segment_count", stats.segment_count);
        set("snmalloc.global_stat_current_size", current);
        set("snmalloc.global_stat_total_size", total);
        set("snmalloc.global_stat_max_size", max);
    }

I don't know. but maybe the above method would create too many entries in the summary?

And any suggestion on creating more concise async metrics for the allocator?

SchrodingerZhu avatar Oct 24 '21 13:10 SchrodingerZhu

Another thing is that it is probably not a good idea to print out the statistics after thread exiting in this situation.

SchrodingerZhu avatar Oct 24 '21 14:10 SchrodingerZhu

image Another interesting part is that, if I enable stats, the dealloc routine would be taken up by costly average calculation. (I cannot provide further stack traces since the product has not been released as open source yet. sorry for that)

SchrodingerZhu avatar Oct 25 '21 06:10 SchrodingerZhu

So those statistics are pretty heavy weight, and were not designed for production. More for working out what snmalloc is doing wrong. They have not really been maintained. There are very coarse statistics available from

https://github.com/microsoft/snmalloc/blob/6e638742e3c66549174d4c264bd05c9435938ac1/src/override/malloc-extensions.cc#L7-L12

This might be sufficient for what you are after. This is tracked all the time and is very cheap. It was considered the bare minimum for some other services.

With the rewrite on the snmalloc2 branch, I am about to investigate statistics tracking. So if you have requirements, I will try to work them into what I build.

mjp41 avatar Oct 25 '21 08:10 mjp41

image I would provide some records from my side. This is from a analytical database engine (single node in this case). It took up almost all the system memory on linux (as it won't madvise them back). The problem is, the server itself uses mmap/mremap for large allocations for get potential speedup from OS paging. So I am very concerned to have this de-commit pattern in a production env.

SchrodingerZhu avatar Oct 25 '21 11:10 SchrodingerZhu

@schrodingerZhu are you able to try #404 for your use case? This getting pretty stable now, and should address your concern about holding on to OS memory.

What is the green line showing in the graph? RSS or Virtual memory usage?

mjp41 avatar Oct 25 '21 14:10 mjp41

according to the name of the metric, it should be RSS. I can also see htop show similar memory usage of my program with the green line.

SchrodingerZhu avatar Oct 25 '21 15:10 SchrodingerZhu

Since all of my works now are in experimental mode, I would like to give snmalloc 2 and #404 a try. I could also report back the changes in performance and the metric.

Thanks for the suggestions! And the above results were still on snmalloc 1 and there was something like a tens of seconds performance bump on some TPCH workload when switching from jemalloc to snmalloc, which really made me astonished. Let's see what we can get with snmalloc 2.

SchrodingerZhu avatar Oct 25 '21 16:10 SchrodingerZhu

I believe #404 is working since we can now see drops of RSS curve.

However, in this case: image

as you can see after some peaks in the memory curve (it tried to acquire more than 169GiB!), the stats suddenly went to zero with snmalloc 2. This means the engine is killed for OOM. Ouch, this is bad; with snmalloc 1 thou the space is not de-committed, I didn't experience OOMs.

The performance of snmalloc 2 degrading was still there, for those successful trials, I could see 100% slow down (from 30s to 1min) for some particular queries. I may provide some flamegraphs on snmalloc stacks when they are ready,

SchrodingerZhu avatar Oct 26 '21 07:10 SchrodingerZhu

image image

Oops, since I was running this on kernel 3.10 I guess madvise with MADV_DONTNEED was too much heavier than I have ever expected.

I think it is madvise that took all the extra running time (up to 30s for that query) in this case.

SchrodingerZhu avatar Oct 26 '21 07:10 SchrodingerZhu

I am going to look into a more consolidating calls to madvise, which will hopefully reduce this cost.

So did it work in terms of reducing the memory usage, or did it regress the memory usage and get OOM. I was clear from your message?

mjp41 avatar Oct 26 '21 08:10 mjp41

  • I can see decrement of the memory usage now: so madvise is working.
  • even with the decrement above, I got a regression of OOM compared with snmalloc 1

SchrodingerZhu avatar Oct 26 '21 08:10 SchrodingerZhu

@schrodingerZhu would you be able to run this experiment again with the latest main branch? I have done a lot of work on bringing down the footprint, most examples are very close to snmalloc 1 now, so would be interested to know if I have fixed this.

mjp41 avatar Mar 16 '22 11:03 mjp41