ginkgo
ginkgo copied to clipboard
Cache-line aligned allocation for OpenMP
Currently, the OpenMP raw_alloc just uses malloc. Typically we would want to have allocated memory blocks aligned to cache-line boundaries, typically 64 bytes. Two things would be needed:
- [ ] Query the cache line length during configuration. I believe most modern CPUs use 64 bytes, but this would be good to have.
- [ ] A simple (non-interface) class to implement the alignment logic (or simply use std::align), hold the pointers involved, and delete it properly in the end. Adding alignment to
OmpExecutor::raw_allocinstead is not a nice option because it's protected, andExecutor::allochas logging that is not thread-safe. Another option to implement the alignment logic and proper deletion is to use the C11aligned_alloc, which is easier.
Especially, the semantics we need is allocation of a group of memory blocks, the beginning of each of which is aligned to 64-byte boundaries. Typically, this can be done by aligning the beginning of the large block and making the stride a multiple of 64.
The current use case on my mind is dynamic "shared memory" for batched openmp solvers.
Our system is not able to do that portably (and may not until C++ supports it natively) since you can't reconstruct the original pointer returned from malloc after it has been aligned.
Yes, that's why I propose to keep the original pointer in a struct/class and delete using it once we are done. For now, I'm thinking of a static scope-based object that can return the aligned pointer when needed, and it would have a destructor that would use the cached original pointer to delete the entire memory block. Maybe this is best done following the allocator interface, and the allocator can be passed to std::vector, for example, too. Boost already has the exact thing we need, and so does C++17, but I guess we don't want to use either of them, at least not for now.
Wait, actually, aligned_alloc is there in the C11 standard, and C++ 17 just "inherits" from that. Maybe we can just use the C11 version while still compiling with C++14. I think this works for the alignment logic and properly freeing the memory, and could be used in OmpExecutor::raw_alloc. aligned_alloc has a slightly greater overhead than simple malloc, though I'm positive the benefits outweigh that. In addition, I would add a small function for allocating a large block of memory containing many sub-blocks each aligned to a cache line.
With C++17, we could extend #1315 to also enable aligned allocation.
We can query the L1 cache size using the "appropriately" named hardware_destructive_interference_size