hardened_malloc icon indicating copy to clipboard operation
hardened_malloc copied to clipboard

use popcount + PDEP for uniform random slot selection on x86 for Haswell and later

Open thestinger opened this issue 6 years ago • 7 comments

PDEP allows selecting the nth unset bit efficiently (a couple cycles) so it's a fantastic way of implementing this. There's no clear way to do it at all efficiently elsewhere, which is why the current portable implementation only randomizes the search start index and then uses the ffs intrinsic.

thestinger avatar Aug 30 '18 00:08 thestinger

I don't have much context about your overall problem, but you can do select reasonably efficiently without PDEP too. See folly for example (https://github.com/facebook/folly/blob/master/folly/experimental/Select64.h).

pgera avatar Apr 01 '21 10:04 pgera

The current code is here:

https://github.com/GrapheneOS/hardened_malloc/blob/master/h_malloc.c#L353-L372

thestinger avatar Apr 01 '21 19:04 thestinger

The non-random implementation is quite fast already and the only reason we don't manually unroll the loop is because the compiler generates code that's as good or better from this. I do have an implementation of unrolling it around somewhere.

The random implementation could be faster and ideally it would be a uniform random choice, not simply starting the search in a random location as we're currently doing. I don't want to pay any additional performance cost to make it better though.

thestinger avatar Apr 01 '21 19:04 thestinger

Ideally, we would have an implementation that is faster and does uniform random choice everywhere, not only modern x86_64.

thestinger avatar Apr 01 '21 19:04 thestinger

As explained on matrix by @thestinger :

it could potentially be improved another way: you can see where that's done in get_free_slot there's a bitmap consisting of 4 x u64 depending on the size class, 1-4 u64 may be used, so it has to iterate only over the ones that are used and for many size classes, the final u64 is only partially used so it has a function (get_mask) to mask out the bits it shouldn't look at each bit represents a slot in the slab most is 256 (for size 0 and 16) and then drops to 128 (so only 2 of the u64 used) for 32 byte allocations. Anyway ideally this would be made faster, the way it currently randomizes slot selection is picking a random index between [0..number_of_slots) and then it starts the search from there, wrapping back around to check the portion before that if needed so that's not a uniform random choice. If we used PDEP we could probably make it both faster and uniform or even faster and more uniform than otherwise but not fully

jvoisin avatar Jan 04 '22 10:01 jvoisin

Just noting that PDEP is terrible on AMD processors before Zen 3. It essentially is emulated in the microcode and gets slower (8 uops) for every bit set in the PDEP mask operand. Performance of PDEP is terrible even on reg-to-reg and its even worse with memory operand.

More discussion/info about it here:

  • https://twitter.com/trav_downs/status/1202793097962364928
  • https://twitter.com/InstLatX64/status/1209095219087585281
  • uops.info: PDEP Skylake vs. Zen 2 vs. Zen 3

PaulusParssinen avatar Oct 18 '22 20:10 PaulusParssinen

There are similar optimisations for Arm NEON, as used by isoalloc

jvoisin avatar Jan 02 '24 00:01 jvoisin