cperl icon indicating copy to clipboard operation
cperl copied to clipboard

small hash

Open rurban opened this issue 9 years ago • 5 comments

optimize hashes with <= 3-5 keys to a simple array of keys and values with linear lookup.

HvSMALL(hv) / XHvSMALL(xhv) is either checking HvMAX < 7, or a flag. If a flag the very first HE* entry needs to be a non-ptr tag (& 0x1). We'd need a flag with inlined HEs and overlong keys, to omit HvSMALL optims with such long keys. We cannot the hv_aux based HvFLAGS with normal HvSMALL hashes, esp. when inlined.

The best would be a he-array alike inlined len/char*/flags/val array to be cache concious. (as in #24 feature/gh24-he-array). The len really should be run-length encoded, then the flags needed for hash cmp need to come first. However at first we start with simple HE* arrays. (array of ptrs, not values) The last array element needs to have an NULL sentinel, so we cannot use all 7 HE*, only 6.

But there are many more simple hash optims, which we do first.

  • extract uncommon magical code from hv_common
  • add __builtin_ctz support (count trailing zeros) and use it instead of division on DO_HSPLIT (done with the builtin-ctz branch)
  • pre-extend hashes as in aassign with av_extend, when the number of keys is on the stack. This speeds up all the big hash inits (e.g. warnings.pm), needing no costly series of splits during initialization.

rurban avatar Dec 30 '15 23:12 rurban

I estimate you can use linear or serial (unsorted) lookup up to 100 keys or even more, depending on benchmarks.

In my port of LCS::BV from Perl to C I began with Bob Jenkins hash and ended the tuning using VLAs (variable length arrays) on the stack, the array serially filled (\0 terminated). See llcs_seq_a() and the used hash_setpos() and hash_getpos(). With Bob Jenkins I get 250 kHz (cases per second) on i5@1500, with serial VLAs 7.5 MHz, thus factor 30x. The calloc variant llcs_seq() comes at 4 MHz.

Of course in my example I can benefit from the known restrictions: maximum size, keys strings immutable, typed values (uint_64).

wollmers avatar Apr 27 '16 18:04 wollmers

So many? I thought I only want to fill one cache line, so just very few keys. But I'll benchmark it soon, when I got more time. Other langs tested 3-5, if I remember.

On Wed, Apr 27, 2016, 20:59 Helmut Wollmersdorfer [email protected] wrote:

I estimate you can use linear or serial (unsorted) lookup up to 100 keys or even more, depending on benchmarks.

In my port of LCS::BV https://github.com/wollmers/LCS-BV/tree/master/lib/LCS from Perl to C I began with Bob Jenkins hash and ended the tuning using VLAs (variable length arrays) on the stack, the array serially filled (\0 terminated). See llcs_seq_a() https://github.com/wollmers/c-lcs-bv/blob/master/lcstest.c#L105 and the used hash_setpos() and hash_getpos(). With Bob Jenkins I get 250 kHz (cases per second) on i5@1500, with serial VLAs 7.5 MHz, thus factor 30x. The calloc variant llcs_seq() https://github.com/wollmers/c-lcs-bv/blob/master/lcstest.c#L144 comes at 4 MHz.

Of course in my example I can benefit from the known restrictions: maximum size, keys strings immutable, typed values (uint_64).

— You are receiving this because you were assigned. Reply to this email directly or view it on GitHub https://github.com/perl11/cperl/issues/102#issuecomment-215193606

rurban avatar Apr 27 '16 22:04 rurban

You should trust only numbers you benchmarked yourself;-)

Hash is said to have complexity O(1). But as always it is O(1*k), where k is the implementation factor.

Serial has O((n/2)*k). A break even point of n=4 between hash and serial would need k_hash = 2 * k_serial. I.e. the hash algorithm executes only the double amount of instructions compared to one iteration of the loop of serial. My serial has 3 instructions (C operators) in the loop including conditions. So for a break even n=4 it would need a hash function (locating the entry in the array) to only use 6 instructions.

I didn't optimize for cache friendlyness directly. Serial just maps a nearly indefinite (sparse) alphabet to a minimal one (none sparse) and keeps nearly the order of filling, which is memory and cache friendly. Hash algorithms (if not perfect hashes) map sparse to not so sparse, but still sparse.

wollmers avatar Apr 28 '16 09:04 wollmers

I went with 7 because this is the initial calloced size. But it doesn't work yet, so I cannot benchmark it.

rurban avatar Jul 17 '16 17:07 rurban

Merged the 3 first parts (ctz, hv_common_magical, pre-extend). In the end it was much faster than expected. On my linux gcc-6.1/i5-2300 it was 8-14% faster in perlbench-run, on my darwin gcc-6-lto/i7-4650U 6% faster.

Having the magical code seperated and abstracted away will also help in the future hash rewrites.

rurban avatar Aug 19 '16 06:08 rurban