Fixed range relying on lazy commit?
For limiting some subsystems to a maximum allocation-size, I was experimenting with the fixed range configuration as it seemed particularly suited for my use-case. I ran into some problems during initialization though, as it seems to be accessing a memory range that has not been committed yet.
Looking at the tests for it, it seems that it relies on everything being committed beforehand.
From the test:
auto size = bits::one_at_bit(28); auto oe_base = DefaultPal::reserve(size); DefaultPal::notify_using<NoZero>(oe_base, size); // <--- Commiting the entire range auto oe_end = pointer_offset(oe_base, size); std::cout << "Allocated region " << oe_base << " - " << pointer_offset(oe_base, size) << std::endl;
CustomGlobals::init(nullptr, oe_base, size);
I'm trying to run this on a platform that does not have lazy commit, so running a notify_using on the entire range would fully commit all pages to the range, which is undesirable.
Main questions:
- Is this behavior (i.e. the need for the range to be committed beforehand) intentional and desired?
- Is there currently a supported way around this so that the allocator will actively call 'notify_using' on the bits that it uses for the pagemap?
I've only just starting looking into snmalloc and I'm impressed by the clean code and high level of configurability, so I am kind of hoping this can be setup in the Config without having to change the backend too much :)
Is this behavior (i.e. the need for the range to be committed beforehand) intentional and desired?
So that particular configuration was added to support SGX enclaves. They don't have any notion of commit, so all the notify_using/notify_notusing calls are basically no-ops or just zeroing if required. There is no requirement for that, and the layers in front will call using/not_using in all the places that are required on platforms without lazy commit like Windows.
Is there currently a supported way around this so that the allocator will actively call
notify_usingon the bits that it uses for the pagemap?
So in the fixed range config it will calculate the space required for the pagemap,
https://github.com/microsoft/snmalloc/blob/ccc03ce0fc1f6ab6f8c2fdb613572a87e97ec7e1/src/snmalloc/backend/fixedglobalconfig.h#L91-L92
and then call notify_using on the whole range used for the page map on initialisation:
https://github.com/microsoft/snmalloc/blob/ccc03ce0fc1f6ab6f8c2fdb613572a87e97ec7e1/src/snmalloc/backend/fixedglobalconfig.h#L102
The rest of the supplied range will have notify_using called as required for allocation of objects.
We need the pagemap to exist for the whole range as we use it as the physical memory for the buddy allocators/red black trees that track the unused, but snmalloc owned memory.
so I am kind of hoping this can be setup in the Config without having to change the backend too much :)
Based on your description you probably need to modify a Pal slightly for your platform so that it calls what you need for notify_using/not_using, and such that it always fails on reserve. This doesn't exist yet, but is pretty easy to extract from the existing Pals.
So I would start with: https://github.com/microsoft/snmalloc/blob/ccc03ce0fc1f6ab6f8c2fdb613572a87e97ec7e1/src/snmalloc/pal/pal_noalloc.h
In this is forwards some operations to the underlying Pal for the system:
https://github.com/microsoft/snmalloc/blob/ccc03ce0fc1f6ab6f8c2fdb613572a87e97ec7e1/src/snmalloc/pal/pal_noalloc.h#L54-L57
but importantly, and not what you want, it doesn't for notify_using/...:
https://github.com/microsoft/snmalloc/blob/ccc03ce0fc1f6ab6f8c2fdb613572a87e97ec7e1/src/snmalloc/pal/pal_noalloc.h#L67-L90
If you changed these to also call into the BasePal, then I think it would work.
If you instantiate a FixedConfig with that Pal everything "should just work" in principle. However, doing something new normally throws up issues, so please let me know how you get on.
I've only just starting looking into snmalloc and I'm impressed by the clean code and high level of configurability,
Thanks. Writing it in C++ has helped a lot of making the configurability much cleaner than the #ifdef mess that typically ends up in C code bases.
Thanks for the quick reply
I tried a few things in the fixed_region.cc test code, just running on normal Windows for now.
One of them indeed using the DefaultPAL directly instead of PALNoAlloc<DefaultPal>, as it would indeed need to call those commits/releases. Second, the notify_using in the top-level code needs to go to avoid committing everything, so I removed that
auto size = bits::one_at_bit(28);
auto oe_base = DefaultPal::reserve(size);
//DefaultPal::notify_using<NoZero>(oe_base, size);
auto oe_end = pointer_offset(oe_base, size);
Adding some debug info in the PAL gives me this:
reserve 0x25a97cc0000 - 0x25aa7cc0000 : 268435456 bytes <-- the initial reserve for the entire range, called from fixed_region.cc notify_using 0x25b2e71f000 - 0x25b2e760000 : 266240 bytes <-- the register_range for the concrete pagemap
The pagemap register_range seems to be shifting things out of the reserved range, causing a commit on an address that hasn't been reserved -> error.
From ds\pagemap.h:
void register_range(address_t p, size_t length)
{
SNMALLOC_ASSERT(is_initialised());
// Calculate range in pagemap that is associated to this space.
auto first = &body[p >> SHIFT];
auto last = &body[(p + length + bits::one_at_bit(SHIFT) - 1) >> SHIFT];
// Commit OS pages associated to the range.
auto page_start = pointer_align_down<OS_PAGE_SIZE, char>(first);
auto page_end = pointer_align_up<OS_PAGE_SIZE, char>(last);
size_t using_size = pointer_diff(page_start, page_end);
PAL::template notify_using<NoZero>(page_start, using_size);
When hacking notify_using to do a RESERVE | COMMIT just to check whether it's simply a forgotten reserve, it still crashed right in the next bit of the calling code on line 106
From backend_helpers\pagemap.h
static void register_range(capptr::Arena<void> p, size_t sz)
{
concretePagemap.register_range(address_cast(p), sz);
if constexpr (!CONSOLIDATE_PAL_ALLOCS)
{
// Mark start of allocation in pagemap.
auto& entry = get_metaentry_mut(address_cast(p));
entry.set_boundary(); <-- **SEGFAULT**
}
}
Here 'entry' is pointing to the beginning of the range, which has been reserved, but not committed.
Example output
reserve 0x25a97cc0000 - 0x25aa7cc0000 : 268435456 bytes notify_using 0x25b2e71f000 - 0x25b2e760000 : 266240 bytes Exception thrown: read access violation. this was 0x25A97CC0100.
I'm not entirely sure what the 'body' really is or should be and why it would be outside of the reserved arena region. Any ideas?
This is precisely the kind of thing I expected with my "However, doing something new normally throws up issues, so please let me know how you get on.".
The pagemap register range code needs a second path for has_bounds == true. The following bit of code is assuming has_bounds == false:
https://github.com/microsoft/snmalloc/blob/ccc03ce0fc1f6ab6f8c2fdb613572a87e97ec7e1/src/snmalloc/ds/pagemap.h#L62-L64
Other code paths have something like: https://github.com/microsoft/snmalloc/blob/ccc03ce0fc1f6ab6f8c2fdb613572a87e97ec7e1/src/snmalloc/ds/pagemap.h#L300-L318
I believe adding something like:
if constexpr (has_bounds)
{
if ((p - base >= size) || (p + length - base >= size))
{
error("Attempting to commit outside of fixed range");
}
// Correct for index into pagemap with bounds.
p = p - base;
}
should work on that function. If this fixes your issue, I'd love to receive a PR. Thanks
Yup, that works!
Amazing, thanks. I'll have a PR on the way later today.
Some follow-up questions by the way, if you have the time
-
Now that this fixed-range allocation works, I would need multiple instances of it. But of course, there's a lot of statics in the types that store the actual data. What I've been doing is parameterize my Config, making a separate type for each heap, which also parameterizes any underlying type, which I think should cover this use-case. Is this assumption correct? Or are there some cases that would need to be separate somehow as to avoid different heaps being mixed up? Also, would there be a way to create heaps at runtime, i.e. have the metadata somehow not be a global for a type?
-
I would also need to customize my PAL for each heap as the mapping/unmapping is slightly different for each heap. I've been feeding a custom PAL that reads from its Config. Is this the intended way? I made sure the DefaultPAL was an empty stub, just to cover cases that use DefaultPal instead of a passed in type.
-
And finally, I still bumped into some OS_PAGE_SIZE usage (which gets the page-size from the DefaultPal), instead of using the heap-specific Pal. I'm guessing this might also just be something that could be done, as it should have the custom PAL available on all call-sites as far as I can see?
Thanks
Now that this fixed-range allocation works, I would need multiple instances of it. But of course, there's a lot of statics in the types that store the actual data. What I've been doing is parameterize my Config, making a separate type for each heap, which also parameterizes any underlying type, which I think should cover this use-case. Is this assumption correct? Or are there some cases that would need to be separate somehow as to avoid different heaps being mixed up?
A separate type for each heap should work. That is what I have been thinking about to do a partition alloc like feature.
Also, would there be a way to create heaps at runtime, i.e. have the metadata somehow not be a global for a type?
Making it dynamic is within reach of the design, but not something I have done yet. We could create an allocator pool that all use the same backend range, but this would require some complex changes that I haven't got time for at the moment. But is on our work list.
It would also be interesting to share the pagemap across all the different heaps, rather than having a page map for each one.
- I would also need to customize my PAL for each heap as the mapping/unmapping is slightly different for each heap. I've been feeding a custom PAL that reads from its Config. Is this the intended way? I made sure the DefaultPAL was an empty stub, just to cover cases that use DefaultPal instead of a passed in type.
That seems reasonable. I haven't considered having many different Pals. Could you elaborate on what the differences are between each heap that makes the Pal differ?
- And finally, I still bumped into some OS_PAGE_SIZE usage (which gets the page-size from the DefaultPal), instead of using the heap-specific Pal. I'm guessing this might also just be something that could be done, as it should have the custom PAL available on all call-sites as far as I can see?
I would like to move this to a more dynamic value that is got from the current Pal. Some platforms have variable page size that can be configured by environment variables. Ultimately, I would like that as a feature, but it hasn't been a priority, yet.
It would also be interesting to share the pagemap across all the different heaps, rather than having a page map for each one.
That could be a useful saving indeed, but for my use-case at least we would need the option to keep them separate as well.
Could you elaborate on what the differences are between each heap that makes the Pal differ?
We're running on unified memory platforms where there are different memory types. We'd need to specify the type to the map/unmap calls. These types have different properties as bandwidth and in which physical chips it is stored. Alignment requirements might also vary per type for example. So we would have one heap-type with "Type A" memory with 16K page-size, and one heap-type with "Type B" memory with 64K or 2M page-size.
This is also why sharing pagemap here would not work for this specific case - the memory is actually physically different, and we cannot mix allocations made for a heap of Type A with allocations for Type B in the same mapped page.
In other cases though, where we would have multiple runtime instances of the same heap-type to limit and localize allocations, sharing a pagemap would be fine.
One thing I'm noticing btw is that using the FixedRange configuration eats us considerably more memory for us In one application we're hitting ~900MB instead of ~600MB In another we're at 7GB instead of just over 2GB
Is that expected or can it at least be explained?
It looks like empty blocks aren't being merged with what appears to be their buddy counter-part, nor being unmapped. Is there still some assumption that everything is pre-committed for the SGX enclave perhaps and thus wouldn't need uncommitting?
Is that expected or can it at least be explained?
Not expected.
It looks like empty blocks aren't being merged with what appears to be their buddy counter-part, nor being unmapped.
If you could provide me a repro of that I would be very interested.
Is there still some assumption that everything is pre-committed for the SGX enclave perhaps and thus wouldn't need uncommitting?
It should be decommitting properly. It is only the Pal that controls if decommit happens or not. Obviously, there could be errors in untested combinations.
Finding a repro that I can share might take a bit of effort, but I'll see what I can do In the meantime, I can probably share these screenshot from our heap visualizer
The diagonal striped areas mean it's mapped. The red areas are active allocations. You can still see the lines around the blocks that have been mapped.
First, the version with the standard configuration, no fixed range. Some black gaps indicate unmapped memory
The same test-case with a fixed-range configured. No black gaps anywhere, even though there's plenty of blocks that have no active allocations and could be released. This is using about 300MB more
Another test case went from 2GB to 7GB when using a fixed range arena, so it's quite substantial. It looks like things are just not being released nor even re-used.
I haven't traced deallocs yet as I had to work on something else for a bit. Somewhere the path taken must deviate somehow.
So this is really interesting. I previously had an issue with a service that seemed to be experience fragmentation over time on SGX. I thought it was due to thread caching just using more memory than the limited SGX machines, so I disabled the thread caching of "chunks" in the backend, and the problem went away. But I wonder if there is something else going on as here it looks similar.
How many threads are you using? What is the thread usage pattern. N-long term threads, or creating and removing lots of threads?
Matt
Also, if you have a chance, would you be able to try #751. That has a simpler thread local state model.
We have about a dozen threads, all long-living
There shouldn't be any teardown of the thread-allocators happening, as all threads are basically permanent (created at startup of the application, destroyed at shutdown).
I can try and give that patch a spin early next week, thanks
I tried the patch, and it doesn't really seem to matter much I'm afraid.
One thing I noticed though. As I'm still testing things, I was just throwing everything into snmalloc, including a single 3GB allocation. Due to its buddy-behavior that alone eats up an extra 1GB.
It's not related to the bigger footprint though. Even in another test where it's mostly small allocations, things are scattered out a lot more compared to the global config while still mapping everything.
I'll see if I can find some time and make an isolated repro case that doesn't involve running our entire application.
Thanks. I have been looking at resurrecting an old PR I never landed: https://github.com/microsoft/snmalloc/pull/616
If I get this working nicely again, then we might be able to trace your memory usage to get some ideas of where things are going wrong.
@Trithek I have update #616 to be usable again. This has a function
snmalloc::print_alloc_stats()
which should dump to stderr some CSV of various stats that could be used to debug the issue. If you were able to dump that periodically during the application run, it would help work out where the problem might be.
Ah, great, much appreciated. I won't have too much time this week, but next week I should be able to give this a try I think Thanks!