hardened_malloc
hardened_malloc copied to clipboard
use popcount + PDEP for uniform random slot selection on x86 for Haswell and later
PDEP allows selecting the nth unset bit efficiently (a couple cycles) so it's a fantastic way of implementing this. There's no clear way to do it at all efficiently elsewhere, which is why the current portable implementation only randomizes the search start index and then uses the ffs intrinsic.
I don't have much context about your overall problem, but you can do select reasonably efficiently without PDEP too. See folly for example (https://github.com/facebook/folly/blob/master/folly/experimental/Select64.h).
The current code is here:
https://github.com/GrapheneOS/hardened_malloc/blob/master/h_malloc.c#L353-L372
The non-random implementation is quite fast already and the only reason we don't manually unroll the loop is because the compiler generates code that's as good or better from this. I do have an implementation of unrolling it around somewhere.
The random implementation could be faster and ideally it would be a uniform random choice, not simply starting the search in a random location as we're currently doing. I don't want to pay any additional performance cost to make it better though.
Ideally, we would have an implementation that is faster and does uniform random choice everywhere, not only modern x86_64.
As explained on matrix by @thestinger :
it could potentially be improved another way: you can see where that's done in
get_free_slot
there's a bitmap consisting of4 x u64
depending on the size class,1-4 u64
may be used, so it has to iterate only over the ones that are used and for many size classes, the finalu64
is only partially used so it has a function (get_mask
) to mask out the bits it shouldn't look at each bit represents a slot in the slab most is 256 (for size 0 and 16) and then drops to 128 (so only 2 of the u64 used) for 32 byte allocations. Anyway ideally this would be made faster, the way it currently randomizes slot selection is picking a random index between[0..number_of_slots)
and then it starts the search from there, wrapping back around to check the portion before that if needed so that's not a uniform random choice. If we usedPDEP
we could probably make it both faster and uniform or even faster and more uniform than otherwise but not fully
Just noting that PDEP is terrible on AMD processors before Zen 3. It essentially is emulated in the microcode and gets slower (8 uops) for every bit set in the PDEP mask operand. Performance of PDEP is terrible even on reg-to-reg and its even worse with memory operand.
More discussion/info about it here:
- https://twitter.com/trav_downs/status/1202793097962364928
- https://twitter.com/InstLatX64/status/1209095219087585281
- uops.info: PDEP Skylake vs. Zen 2 vs. Zen 3
There are similar optimisations for Arm NEON, as used by isoalloc