starrocks icon indicating copy to clipboard operation
starrocks copied to clipboard

[Enhancement] improve crc64 for hashset performance

Open murphyatwork opened this issue 5 months ago • 4 comments

Why I'm doing:

What I'm doing:

Add a mix for crc64 to improve the hash function quality.

The current crc64 function is utilized for phmap::flat_hash_set, which uses the lowest 7 bits as a fingerprint to enhance lookup performance. However, crc64 is not an ideal hash function for this purpose due to several reasons:

  1. Mismatch with SwissTable Assumptions: SwissTable uses the upper 57 bits for group indexing and the lower 7 bits as a fingerprint. CRC-32 only provides 32 bits, and its lower bits exhibit poor correlation (the higher bits are more active in polynomial feedback). Truncating directly results in the "poor 7 bits" being used as fingerprints, which increases collision rates within the same probe group.
  2. Predictable Linear Properties: CRC is a linear codec, meaning it follows the property crc(a ⊕ b) = crc(a) ⊕ crc(b). For structured keys like log IDs or sequential numbers, this predictability can lead to patterned collisions, resulting in longer probe chains and more frequent rehashing.

Experiment:

  1. In the Insert scenario, Mixed performs significantly faster than Unmixed due to the grouped hashing mechanism. This approach requires iterating through all 16 slots in a group to ensure no equivalent element exists within the group, if the fingerprint cannot effectively filter the unequal element.
  2. In the Lookup scenario, the performance difference is negligible, as comparing 32-byte strings is not computationally expensive.
// -----------------------------------------------------------------------------------------------
// Benchmark                                                     Time             CPU   Iterations
// -----------------------------------------------------------------------------------------------
// BM_PhmapFlatHashSet_CrcHash64_Insert/10000              1098692 ns      1098642 ns          642
// BM_PhmapFlatHashSet_CrcHash64_Insert/100000            13848007 ns     13846984 ns           48
// BM_PhmapFlatHashSet_CrcHash64_Insert/1000000          552114016 ns    552078649 ns            1
// BM_PhmapFlatHashSet_CrcHash64_Insert_Unmixed/10000      1290259 ns      1290163 ns          541
// BM_PhmapFlatHashSet_CrcHash64_Insert_Unmixed/100000    18784369 ns     18783587 ns           38
// BM_PhmapFlatHashSet_CrcHash64_Insert_Unmixed/1000000  626093501 ns    626056004 ns            1
// BM_PhmapFlatHashSet_CrcHash64_Lookup/10000               605826 ns       605798 ns         1161
// BM_PhmapFlatHashSet_CrcHash64_Lookup/100000            11105350 ns     11104843 ns           63
// BM_PhmapFlatHashSet_CrcHash64_Lookup/1000000          254053820 ns    254043024 ns            3
// BM_PhmapFlatHashSet_CrcHash64_Lookup_Unmixed/10000       709504 ns       709473 ns          987
// BM_PhmapFlatHashSet_CrcHash64_Lookup_Unmixed/100000    10992399 ns     10991650 ns           63
// BM_PhmapFlatHashSet_CrcHash64_Lookup_Unmixed/1000000  254696960 ns    254686158 ns            3

TODO: Consider if other hash functions might be more suitable for flat_hash_set.

Fixes #issue

What type of PR is this:

  • [ ] BugFix
  • [ ] Feature
  • [x] Enhancement
  • [ ] Refactor
  • [ ] UT
  • [ ] Doc
  • [ ] Tool

Does this PR entail a change in behavior?

  • [ ] Yes, this PR will result in a change in behavior.
  • [x] No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • [ ] Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • [ ] Parameter changes: default values, similar parameters but with different default values
  • [ ] Policy changes: use new policy to replace old one, functionality automatically enabled
  • [ ] Feature removed
  • [ ] Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • [ ] I have added test cases for my bug fix or my new feature
  • [ ] This pr needs user documentation (for new or modified features or behaviors)
    • [ ] I have added documentation for my new feature or new function
  • [ ] This is a backport pr

Bugfix cherry-pick branch check:

  • [x] I have checked the version labels which the pr will be auto-backported to the target branch
    • [ ] 3.5
    • [ ] 3.4
    • [ ] 3.3

murphyatwork avatar Jun 19 '25 08:06 murphyatwork

I think maybe you can check collision count, instead of just comparing performance.

dirtysalt avatar Jun 20 '25 07:06 dirtysalt

@Mergifyio rebase

murphyatwork avatar Jun 30 '25 02:06 murphyatwork

rebase

✅ Branch has been successfully rebased

mergify[bot] avatar Jun 30 '25 02:06 mergify[bot]

@Mergifyio rebase

murphyatwork avatar Jul 03 '25 10:07 murphyatwork

rebase

✅ Branch has been successfully rebased

mergify[bot] avatar Jul 03 '25 10:07 mergify[bot]

[Java-Extensions Incremental Coverage Report]

:white_check_mark: pass : 0 / 0 (0%)

github-actions[bot] avatar Jul 03 '25 13:07 github-actions[bot]

[FE Incremental Coverage Report]

:white_check_mark: pass : 0 / 0 (0%)

github-actions[bot] avatar Jul 03 '25 13:07 github-actions[bot]

[BE Incremental Coverage Report]

:white_check_mark: pass : 4 / 4 (100.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
:large_blue_circle: be/src/util/hash.h 4 4 100.00% []

github-actions[bot] avatar Jul 03 '25 14:07 github-actions[bot]

@Mergifyio backport branch-3.5

github-actions[bot] avatar Jul 04 '25 02:07 github-actions[bot]

backport branch-3.5

✅ Backports have been created

mergify[bot] avatar Jul 04 '25 02:07 mergify[bot]