starrocks
starrocks copied to clipboard
[Enhancement] improve crc64 for hashset performance
Why I'm doing:
What I'm doing:
Add a mix for crc64 to improve the hash function quality.
The current crc64 function is utilized for phmap::flat_hash_set, which uses the lowest 7 bits as a fingerprint to enhance lookup performance. However, crc64 is not an ideal hash function for this purpose due to several reasons:
- Mismatch with SwissTable Assumptions: SwissTable uses the upper 57 bits for group indexing and the lower 7 bits as a fingerprint. CRC-32 only provides 32 bits, and its lower bits exhibit poor correlation (the higher bits are more active in polynomial feedback). Truncating directly results in the "poor 7 bits" being used as fingerprints, which increases collision rates within the same probe group.
- Predictable Linear Properties: CRC is a linear codec, meaning it follows the property crc(a ⊕ b) = crc(a) ⊕ crc(b). For structured keys like log IDs or sequential numbers, this predictability can lead to patterned collisions, resulting in longer probe chains and more frequent rehashing.
Experiment:
- In the Insert scenario,
Mixedperforms significantly faster thanUnmixeddue to the grouped hashing mechanism. This approach requires iterating through all 16 slots in a group to ensure no equivalent element exists within the group, if the fingerprint cannot effectively filter the unequal element. - In the Lookup scenario, the performance difference is negligible, as comparing 32-byte strings is not computationally expensive.
// -----------------------------------------------------------------------------------------------
// Benchmark Time CPU Iterations
// -----------------------------------------------------------------------------------------------
// BM_PhmapFlatHashSet_CrcHash64_Insert/10000 1098692 ns 1098642 ns 642
// BM_PhmapFlatHashSet_CrcHash64_Insert/100000 13848007 ns 13846984 ns 48
// BM_PhmapFlatHashSet_CrcHash64_Insert/1000000 552114016 ns 552078649 ns 1
// BM_PhmapFlatHashSet_CrcHash64_Insert_Unmixed/10000 1290259 ns 1290163 ns 541
// BM_PhmapFlatHashSet_CrcHash64_Insert_Unmixed/100000 18784369 ns 18783587 ns 38
// BM_PhmapFlatHashSet_CrcHash64_Insert_Unmixed/1000000 626093501 ns 626056004 ns 1
// BM_PhmapFlatHashSet_CrcHash64_Lookup/10000 605826 ns 605798 ns 1161
// BM_PhmapFlatHashSet_CrcHash64_Lookup/100000 11105350 ns 11104843 ns 63
// BM_PhmapFlatHashSet_CrcHash64_Lookup/1000000 254053820 ns 254043024 ns 3
// BM_PhmapFlatHashSet_CrcHash64_Lookup_Unmixed/10000 709504 ns 709473 ns 987
// BM_PhmapFlatHashSet_CrcHash64_Lookup_Unmixed/100000 10992399 ns 10991650 ns 63
// BM_PhmapFlatHashSet_CrcHash64_Lookup_Unmixed/1000000 254696960 ns 254686158 ns 3
TODO: Consider if other hash functions might be more suitable for flat_hash_set.
Fixes #issue
What type of PR is this:
- [ ] BugFix
- [ ] Feature
- [x] Enhancement
- [ ] Refactor
- [ ] UT
- [ ] Doc
- [ ] Tool
Does this PR entail a change in behavior?
- [ ] Yes, this PR will result in a change in behavior.
- [x] No, this PR will not result in a change in behavior.
If yes, please specify the type of change:
- [ ] Interface/UI changes: syntax, type conversion, expression evaluation, display information
- [ ] Parameter changes: default values, similar parameters but with different default values
- [ ] Policy changes: use new policy to replace old one, functionality automatically enabled
- [ ] Feature removed
- [ ] Miscellaneous: upgrade & downgrade compatibility, etc.
Checklist:
- [ ] I have added test cases for my bug fix or my new feature
- [ ] This pr needs user documentation (for new or modified features or behaviors)
- [ ] I have added documentation for my new feature or new function
- [ ] This is a backport pr
Bugfix cherry-pick branch check:
- [x] I have checked the version labels which the pr will be auto-backported to the target branch
- [ ] 3.5
- [ ] 3.4
- [ ] 3.3
I think maybe you can check collision count, instead of just comparing performance.
@Mergifyio rebase
rebase
✅ Branch has been successfully rebased
@Mergifyio rebase
rebase
✅ Branch has been successfully rebased
[Java-Extensions Incremental Coverage Report]
:white_check_mark: pass : 0 / 0 (0%)
[FE Incremental Coverage Report]
:white_check_mark: pass : 0 / 0 (0%)
[BE Incremental Coverage Report]
:white_check_mark: pass : 4 / 4 (100.00%)
file detail
| path | covered_line | new_line | coverage | not_covered_line_detail | |
|---|---|---|---|---|---|
| :large_blue_circle: | be/src/util/hash.h | 4 | 4 | 100.00% | [] |
@Mergifyio backport branch-3.5
backport branch-3.5
✅ Backports have been created
- #60593 [Enhancement] improve crc64 for hashset performance (backport #60074) has been created for branch
branch-3.5