dmu_objset: replace `dnode_hash` impl with `cityhash4`
Motivation and Context
As mentioned in PR #16131, replacing CRC-based hash with cityhash4 could slightly improve the performance by eliminating memory access. Replacing algorightm is safe since the hash result is not persisted.
See: openzfs/zfs#16131
Description
Just replace CRC-based hash with cityhash4.
How Has This Been Tested?
Compiled on my machine. Let's see the ZTS results on CI.
Types of changes
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [x] Performance enhancement (non-breaking change which improves efficiency)
- [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
- [ ] Documentation (a change to man pages or other documentation)
Checklist:
- [x] My code follows the OpenZFS code style requirements.
- [ ] I have updated the documentation accordingly.
- [x] I have read the contributing document.
- [ ] I have added tests to cover my changes.
- [ ] I have run the ZFS Test Suite with this change applied.
- [x] All commit messages are properly formatted and contain
Signed-off-by.
I have no particular objection, I'm just wondering if you have a test case where you saw this showing up a lot in your performance bottlenecks?
@rincebrain I did not run a benchmark. But I can do a micro-benchmark if needed. And there is no obvious evidence (e.g. profiling result) indicating this hashing is a bottleneck either.
I am doing this just because @amotin mentioned it, and it did make the code cleaner. As for performance, the replacement would eliminate all memory access to the CRC constant table, as well as emitting fewer instructions. So intuitively I think it would provide some (maybe slight) boost, if observable.
This seems like a fascinating question to ask @allanjude with #16486 and #16487. (Mostly because I'm suspecting that with that work he's hitting workloads with enough scale to notice a difference.)
It looks much cleaner for me, but I slightly worry about number of 64bit multiplications in
cityhash4()and their time on older CPUs. I wonder if we could create reduced versions ofcityhash4()for smaller number of arguments, since I see few other use cases not using all 4.
Thanks for your review. Do you mean performing some manual "specialization" on cityhash4 by replacing some variables with 0? I think if we can make cityhash4 a fully inline function, maybe the compiler is smart enough to do this for us?
Do you mean performing some manual "specialization" on
cityhash4by replacing some variables with 0?
Yes.
I think if we can make
cityhash4a fully inline function, maybe the compiler is smart enough to do this for us?
Yes, I think modern compilers could be clever enough for that. The only question is whether I want it to be inlined everywhere. Though it might be not that bad.
Yes, I think modern compilers could be clever enough for that. The only question is whether I want it to be inlined everywhere. Though it might be not that bad.
Just have a quick (and rough) try on compiler explorer. On x86_64 with gcc 14, specializing cityhash2(x, y) with cityhash4(x, y, 0, 0) reduces imul insn count from 6 to 4 (with total #insn from 28 to 21). When using -m32 it reduces 12 imul to 8 (#insn 102 -> 67). Additionally, the specialized version largely reduces stack memory access on 32-bit platforms. On other architectures (e.g. arm, riscv) situations are similar.
So I think we indeed need a specialized cityhash2, or even cityhash1. The question is now whether we should inline the hash function everywhere (code size will slightly grow, but the compiler can always do its best), or provide and export different symbol names.
The question is now whether we should inline the hash function everywhere (code size will slightly grow, but the compiler can always do its best), or provide and export different symbol names.
I don't think the code size should be very important. But effect on stack size of the calling function could be verified.
But effect on stack size of the calling function could be verified.
Per my understanding, inlining calls to a function will always lead to less (or at least equal) stack usage than caller + callee combined. Inlining only has (potential) negative effects on compilation speed, code size and i-cache.
inlining calls to a function will always lead to less (or at least equal) stack usage than caller + callee combined
It may be true in some cases due to compiler's better ability to optimize things. But it leaks on-stack variables of the child function, that would exist only during its execution, into the parent, where they may increase stack usage for other children the parent calls.
But it leaks on-stack variables of the child function, that would exist only during its execution, into the parent, where they may increase stack usage for other children the parent calls.
Thanks for explaining, I understand what you mean now. It seems not that easy to diagnosis stack usage in compiling kernel modules, especially due to the inability to add -fstack-usage into CFAGS. Decomping is also annoying since gcc has already performed a lot of inlining, so tracing down certain symbols is not straightforward.
I don't think the code size should be very important.
If so, I can just create and export specialized cityhash1 and cityhash2.
If so, I can just create and export specialized
cityhash1andcityhash2.
I would not object. And may be cityhash3 also.
@amotin I modified the PR to include specialization versions of cityhash4. It passes build check and should not affect any ztest. Please review again when available.
@amotin After re-evaluating, I decided to add all 1/2/3-arg versions of cityhash, since reducing each argument could save some more 64b multiplications and should be worthwhile.
Also @behlendorf would you please review and consider merging?
I like the idea, but without benchmarks I would not say yes here. Maybe you can write some small test utility for testing.
I will test compile this then on PPC64 and AARCH64 also and publish the results. If this PR has some time... I can maybe write some utility myself. But I don't know when I get to do this.
@mcmilk I wrote a really small & simple micro-benchmark here: https://gist.github.com/Harry-Chen/0d1eee1a4d8a7685b50c31316ca75671.
And here's some rough test results (order: crc64, cityhash2, cityhash4, unit: us) on some of my devices:
- x86_64 (i9-14900K): 0.668, 0.623, 0.623
- x86_64 (compiler explorer): 0.934, 0.544, 0.542
- i686 (i9-14900K with
-m32): 3.768, 0.812, 0.818 (crc64 is weirdly slow) - i686 (compiler explorer with
-m32): 1.740, 1.640, 1.639 - aarch64 (Kunpeng 920): 1.528, 0.946, 0.946
- ppc64el (POWER8NVL): 1.489, 1.406, 1.406
To me it is safe to conclude cityhash does make hashing faster (if observable). Unfortunately I do not have more platforms (e.g. armhf, mipsel, riscv64) to test, on which specializing cityhash with <4 arguments might help further.
Just for fun ran it on x86_64 amd ryzen 7840u,
best run: crc64 0.658us, city2 0.383us, city4 0.383us
gmelikov@minime:/tmp/test$ ./hash_bench
[==========] Running 3 benchmarks.
[ RUN ] dnode_hash.crc64
[ OK ] dnode_hash.crc64 (mean 0.986us, confidence interval +- 1.555265%)
[ RUN ] dnode_hash.city2
[ OK ] dnode_hash.city2 (mean 0.406us, confidence interval +- 0.142701%)
[ RUN ] dnode_hash.city4
[ OK ] dnode_hash.city4 (mean 0.406us, confidence interval +- 0.141682%)
[==========] 3 benchmarks ran.
[ PASSED ] 3 benchmarks.
gmelikov@minime:/tmp/test$ ./hash_bench
[==========] Running 3 benchmarks.
[ RUN ] dnode_hash.crc64
[ OK ] dnode_hash.crc64 (mean 0.827us, confidence interval +- 2.392147%)
[ RUN ] dnode_hash.city2
[ OK ] dnode_hash.city2 (mean 0.407us, confidence interval +- 0.312449%)
[ RUN ] dnode_hash.city4
[ OK ] dnode_hash.city4 (mean 0.406us, confidence interval +- 0.316596%)
[==========] 3 benchmarks ran.
[ PASSED ] 3 benchmarks.
gmelikov@minime:/tmp/test$ ./hash_bench
[==========] Running 3 benchmarks.
[ RUN ] dnode_hash.crc64
[ OK ] dnode_hash.crc64 (mean 0.658us, confidence interval +- 0.198083%)
[ RUN ] dnode_hash.city2
[ OK ] dnode_hash.city2 (mean 0.383us, confidence interval +- 0.331032%)
[ RUN ] dnode_hash.city4
[ OK ] dnode_hash.city4 (mean 0.383us, confidence interval +- 0.315814%)
[==========] 3 benchmarks ran.
[ PASSED ] 3 benchmarks.
I wonder why is the difference between city2 and city4 is so insignificant. It would be good to verify that inlining optimization does what is expected, now that all the functions are in the same file and compiler has ability to inline more. May be some noinline parameters could be fair.
Yea. At least building with Clang on FreeBSD dnode_hash_city2() and dnode_hash_city4() appeared identical until addition of __noinline to cityhash*() fixed the optimization (AMD EPYC 7402):
mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN ] dnode_hash.crc64
[ OK ] dnode_hash.crc64 (mean 1.071us, confidence interval +- 1.403153%)
[ RUN ] dnode_hash.city2
[ OK ] dnode_hash.city2 (mean 0.610us, confidence interval +- 0.054205%)
[ RUN ] dnode_hash.city4
[ OK ] dnode_hash.city4 (mean 0.715us, confidence interval +- 1.291311%)
[==========] 3 benchmarks ran.
[ PASSED ] 3 benchmarks.
mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN ] dnode_hash.crc64
[ OK ] dnode_hash.crc64 (mean 1.001us, confidence interval +- 0.045569%)
[ RUN ] dnode_hash.city2
[ OK ] dnode_hash.city2 (mean 0.611us, confidence interval +- 0.055383%)
[ RUN ] dnode_hash.city4
[ OK ] dnode_hash.city4 (mean 0.645us, confidence interval +- 0.936527%)
[==========] 3 benchmarks ran.
[ PASSED ] 3 benchmarks.
mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN ] dnode_hash.crc64
[ OK ] dnode_hash.crc64 (mean 1.000us, confidence interval +- 0.047267%)
[ RUN ] dnode_hash.city2
[ OK ] dnode_hash.city2 (mean 0.610us, confidence interval +- 0.056379%)
[ RUN ] dnode_hash.city4
[ OK ] dnode_hash.city4 (mean 0.645us, confidence interval +- 1.383868%)
[==========] 3 benchmarks ran.
The difference may be not huge, but reproducible. And on 32bits is predictably even bigger:
mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN ] dnode_hash.crc64
[ OK ] dnode_hash.crc64 (mean 1.134us, confidence interval +- 0.154648%)
[ RUN ] dnode_hash.city2
[ OK ] dnode_hash.city2 (mean 1.319us, confidence interval +- 1.332820%)
[ RUN ] dnode_hash.city4
[ OK ] dnode_hash.city4 (mean 1.536us, confidence interval +- 0.477121%)
[==========] 3 benchmarks ran.
[ PASSED ] 3 benchmarks.
mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN ] dnode_hash.crc64
[ OK ] dnode_hash.crc64 (mean 1.129us, confidence interval +- 0.057105%)
[ RUN ] dnode_hash.city2
[ OK ] dnode_hash.city2 (mean 1.241us, confidence interval +- 0.643688%)
[ RUN ] dnode_hash.city4
[ OK ] dnode_hash.city4 (mean 1.532us, confidence interval +- 0.456539%)
[==========] 3 benchmarks ran.
[ PASSED ] 3 benchmarks.
mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN ] dnode_hash.crc64
[ OK ] dnode_hash.crc64 (mean 1.136us, confidence interval +- 0.815983%)
[ RUN ] dnode_hash.city2
[ OK ] dnode_hash.city2 (mean 1.281us, confidence interval +- 1.198090%)
[ RUN ] dnode_hash.city4
[ OK ] dnode_hash.city4 (mean 1.564us, confidence interval +- 0.764337%)
[==========] 3 benchmarks ran.
[ PASSED ] 3 benchmarks.
Thank you for the benchmark :+1:
On my local Ryzen 7 PRO 5850U, fixed to 1,4GHz it looks like this:
[==========] Running 3 benchmarks.
[ RUN ] dnode_hash.crc64
[ OK ] dnode_hash.crc64 (mean 2.163us, confidence interval +- 0.028878%)
[ RUN ] dnode_hash.city2
[ OK ] dnode_hash.city2 (mean 1.221us, confidence interval +- 0.026899%)
[ RUN ] dnode_hash.city4
[ OK ] dnode_hash.city4 (mean 1.221us, confidence interval +- 0.024906%)
[==========] 3 benchmarks ran.
[ PASSED ] 3 benchmarks.
[==========] Running 3 benchmarks.
[ RUN ] dnode_hash.crc64
[ OK ] dnode_hash.crc64 (mean 2.163us, confidence interval +- 0.029074%)
[ RUN ] dnode_hash.city2
[ OK ] dnode_hash.city2 (mean 1.221us, confidence interval +- 0.029488%)
[ RUN ] dnode_hash.city4
[ OK ] dnode_hash.city4 (mean 1.221us, confidence interval +- 0.032090%)
[==========] 3 benchmarks ran.
[ PASSED ] 3 benchmarks.
On ARM Neoverse-N1 it looks like this:
[==========] Running 3 benchmarks.
[ RUN ] dnode_hash.crc64
[ OK ] dnode_hash.crc64 (mean 1.409us, confidence interval +- 2.150333%)
[ RUN ] dnode_hash.city2
[ OK ] dnode_hash.city2 (mean 1.079us, confidence interval +- 0.106291%)
[ RUN ] dnode_hash.city4
[ OK ] dnode_hash.city4 (mean 1.074us, confidence interval +- 0.153203%)
[==========] 3 benchmarks ran.
[ PASSED ] 3 benchmarks.
[mcmilk@lea ~/src/hash]$ ./a.out
[==========] Running 3 benchmarks.
[ RUN ] dnode_hash.crc64
[ OK ] dnode_hash.crc64 (mean 1.397us, confidence interval +- 0.093639%)
[ RUN ] dnode_hash.city2
[ OK ] dnode_hash.city2 (mean 1.077us, confidence interval +- 0.118394%)
[ RUN ] dnode_hash.city4
[ OK ] dnode_hash.city4 (mean 1.072us, confidence interval +- 0.164758%)
[==========] 3 benchmarks ran.
[ PASSED ] 3 benchmarks.
On PPC64LE (POWER10, pvr 0080 0200) - these is no big difference:
[==========] Running 3 benchmarks.
[ RUN ] dnode_hash.crc64
[ OK ] dnode_hash.crc64 (mean 0.979us, confidence interval +- 0.092565%)
[ RUN ] dnode_hash.city2
[ OK ] dnode_hash.city2 (mean 0.924us, confidence interval +- 0.100753%)
[ RUN ] dnode_hash.city4
[ OK ] dnode_hash.city4 (mean 0.924us, confidence interval +- 0.099303%)
[==========] 3 benchmarks ran.
[ PASSED ] 3 benchmarks.
[==========] Running 3 benchmarks.
[ RUN ] dnode_hash.crc64
[ OK ] dnode_hash.crc64 (mean 0.979us, confidence interval +- 0.092653%)
[ RUN ] dnode_hash.city2
[ OK ] dnode_hash.city2 (mean 0.924us, confidence interval +- 0.100670%)
[ RUN ] dnode_hash.city4
[ OK ] dnode_hash.city4 (mean 0.924us, confidence interval +- 0.102227%)
[==========] 3 benchmarks ran.
[ PASSED ] 3 benchmarks.
So yes, it looks okay.
Just for fun:
$ ./hash_bench
[==========] Running 3 benchmarks.
[ RUN ] dnode_hash.crc64
[ OK ] dnode_hash.crc64 (mean 6.005us, confidence interval +- 2.068187%)
[ RUN ] dnode_hash.city2
[ OK ] dnode_hash.city2 (mean 4.971us, confidence interval +- 0.022014%)
[ RUN ] dnode_hash.city4
[ OK ] dnode_hash.city4 (mean 5.015us, confidence interval +- 2.424526%)
[==========] 3 benchmarks ran.
[ PASSED ] 3 benchmarks.
$ grep Model /proc/cpuinfo
Model : Raspberry Pi 4 Model B Rev 1.4
Yea. At least building with Clang on FreeBSD
dnode_hash_city2()anddnode_hash_city4()appeared identical until addition of__noinlinetocityhash*()fixed the optimization (AMD EPYC 7402):
Thanks! I did not read the emitted assembly of different tests very carefully. So yes, specializing cityhash does some good stuff.
Good stuff, thanks for providing the micro benchmark for this. If you can rebase this on master to resolve the conflict it should be good to go.
I've rebased. Seems now CI takes much longer time...