zfs icon indicating copy to clipboard operation
zfs copied to clipboard

dmu_objset: replace `dnode_hash` impl with `cityhash4`

Open Harry-Chen opened this issue 1 year ago • 2 comments

Motivation and Context

As mentioned in PR #16131, replacing CRC-based hash with cityhash4 could slightly improve the performance by eliminating memory access. Replacing algorightm is safe since the hash result is not persisted.

See: openzfs/zfs#16131

Description

Just replace CRC-based hash with cityhash4.

How Has This Been Tested?

Compiled on my machine. Let's see the ZTS results on CI.

Types of changes

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [ ] New feature (non-breaking change which adds functionality)
  • [x] Performance enhancement (non-breaking change which improves efficiency)
  • [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • [ ] Documentation (a change to man pages or other documentation)

Checklist:

  • [x] My code follows the OpenZFS code style requirements.
  • [ ] I have updated the documentation accordingly.
  • [x] I have read the contributing document.
  • [ ] I have added tests to cover my changes.
  • [ ] I have run the ZFS Test Suite with this change applied.
  • [x] All commit messages are properly formatted and contain Signed-off-by.

Harry-Chen avatar Aug 27 '24 08:08 Harry-Chen

I have no particular objection, I'm just wondering if you have a test case where you saw this showing up a lot in your performance bottlenecks?

rincebrain avatar Aug 27 '24 20:08 rincebrain

@rincebrain I did not run a benchmark. But I can do a micro-benchmark if needed. And there is no obvious evidence (e.g. profiling result) indicating this hashing is a bottleneck either.

I am doing this just because @amotin mentioned it, and it did make the code cleaner. As for performance, the replacement would eliminate all memory access to the CRC constant table, as well as emitting fewer instructions. So intuitively I think it would provide some (maybe slight) boost, if observable.

Harry-Chen avatar Aug 28 '24 04:08 Harry-Chen

This seems like a fascinating question to ask @allanjude with #16486 and #16487. (Mostly because I'm suspecting that with that work he's hitting workloads with enough scale to notice a difference.)

rincebrain avatar Aug 30 '24 03:08 rincebrain

It looks much cleaner for me, but I slightly worry about number of 64bit multiplications in cityhash4() and their time on older CPUs. I wonder if we could create reduced versions of cityhash4() for smaller number of arguments, since I see few other use cases not using all 4.

Thanks for your review. Do you mean performing some manual "specialization" on cityhash4 by replacing some variables with 0? I think if we can make cityhash4 a fully inline function, maybe the compiler is smart enough to do this for us?

Harry-Chen avatar Sep 05 '24 16:09 Harry-Chen

Do you mean performing some manual "specialization" on cityhash4 by replacing some variables with 0?

Yes.

I think if we can make cityhash4 a fully inline function, maybe the compiler is smart enough to do this for us?

Yes, I think modern compilers could be clever enough for that. The only question is whether I want it to be inlined everywhere. Though it might be not that bad.

amotin avatar Sep 05 '24 16:09 amotin

Yes, I think modern compilers could be clever enough for that. The only question is whether I want it to be inlined everywhere. Though it might be not that bad.

Just have a quick (and rough) try on compiler explorer. On x86_64 with gcc 14, specializing cityhash2(x, y) with cityhash4(x, y, 0, 0) reduces imul insn count from 6 to 4 (with total #insn from 28 to 21). When using -m32 it reduces 12 imul to 8 (#insn 102 -> 67). Additionally, the specialized version largely reduces stack memory access on 32-bit platforms. On other architectures (e.g. arm, riscv) situations are similar.

So I think we indeed need a specialized cityhash2, or even cityhash1. The question is now whether we should inline the hash function everywhere (code size will slightly grow, but the compiler can always do its best), or provide and export different symbol names.

Harry-Chen avatar Sep 05 '24 17:09 Harry-Chen

The question is now whether we should inline the hash function everywhere (code size will slightly grow, but the compiler can always do its best), or provide and export different symbol names.

I don't think the code size should be very important. But effect on stack size of the calling function could be verified.

amotin avatar Sep 05 '24 17:09 amotin

But effect on stack size of the calling function could be verified.

Per my understanding, inlining calls to a function will always lead to less (or at least equal) stack usage than caller + callee combined. Inlining only has (potential) negative effects on compilation speed, code size and i-cache.

Harry-Chen avatar Sep 05 '24 17:09 Harry-Chen

inlining calls to a function will always lead to less (or at least equal) stack usage than caller + callee combined

It may be true in some cases due to compiler's better ability to optimize things. But it leaks on-stack variables of the child function, that would exist only during its execution, into the parent, where they may increase stack usage for other children the parent calls.

amotin avatar Sep 06 '24 15:09 amotin

But it leaks on-stack variables of the child function, that would exist only during its execution, into the parent, where they may increase stack usage for other children the parent calls.

Thanks for explaining, I understand what you mean now. It seems not that easy to diagnosis stack usage in compiling kernel modules, especially due to the inability to add -fstack-usage into CFAGS. Decomping is also annoying since gcc has already performed a lot of inlining, so tracing down certain symbols is not straightforward.

I don't think the code size should be very important.

If so, I can just create and export specialized cityhash1 and cityhash2.

Harry-Chen avatar Sep 06 '24 16:09 Harry-Chen

If so, I can just create and export specialized cityhash1 and cityhash2.

I would not object. And may be cityhash3 also.

amotin avatar Sep 06 '24 17:09 amotin

@amotin I modified the PR to include specialization versions of cityhash4. It passes build check and should not affect any ztest. Please review again when available.

Harry-Chen avatar Sep 07 '24 15:09 Harry-Chen

@amotin After re-evaluating, I decided to add all 1/2/3-arg versions of cityhash, since reducing each argument could save some more 64b multiplications and should be worthwhile.

Also @behlendorf would you please review and consider merging?

Harry-Chen avatar Sep 09 '24 13:09 Harry-Chen

I like the idea, but without benchmarks I would not say yes here. Maybe you can write some small test utility for testing.

I will test compile this then on PPC64 and AARCH64 also and publish the results. If this PR has some time... I can maybe write some utility myself. But I don't know when I get to do this.

mcmilk avatar Sep 16 '24 16:09 mcmilk

@mcmilk I wrote a really small & simple micro-benchmark here: https://gist.github.com/Harry-Chen/0d1eee1a4d8a7685b50c31316ca75671.

And here's some rough test results (order: crc64, cityhash2, cityhash4, unit: us) on some of my devices:

  • x86_64 (i9-14900K): 0.668, 0.623, 0.623
  • x86_64 (compiler explorer): 0.934, 0.544, 0.542
  • i686 (i9-14900K with -m32): 3.768, 0.812, 0.818 (crc64 is weirdly slow)
  • i686 (compiler explorer with -m32): 1.740, 1.640, 1.639
  • aarch64 (Kunpeng 920): 1.528, 0.946, 0.946
  • ppc64el (POWER8NVL): 1.489, 1.406, 1.406

To me it is safe to conclude cityhash does make hashing faster (if observable). Unfortunately I do not have more platforms (e.g. armhf, mipsel, riscv64) to test, on which specializing cityhash with <4 arguments might help further.

Harry-Chen avatar Sep 16 '24 18:09 Harry-Chen

Just for fun ran it on x86_64 amd ryzen 7840u, best run: crc64 0.658us, city2 0.383us, city4 0.383us

gmelikov@minime:/tmp/test$ ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 0.986us, confidence interval +- 1.555265%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.406us, confidence interval +- 0.142701%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.406us, confidence interval +- 0.141682%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
gmelikov@minime:/tmp/test$ ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 0.827us, confidence interval +- 2.392147%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.407us, confidence interval +- 0.312449%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.406us, confidence interval +- 0.316596%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
gmelikov@minime:/tmp/test$ ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 0.658us, confidence interval +- 0.198083%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.383us, confidence interval +- 0.331032%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.383us, confidence interval +- 0.315814%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.

gmelikov avatar Sep 16 '24 19:09 gmelikov

I wonder why is the difference between city2 and city4 is so insignificant. It would be good to verify that inlining optimization does what is expected, now that all the functions are in the same file and compiler has ability to inline more. May be some noinline parameters could be fair.

amotin avatar Sep 16 '24 19:09 amotin

Yea. At least building with Clang on FreeBSD dnode_hash_city2() and dnode_hash_city4() appeared identical until addition of __noinline to cityhash*() fixed the optimization (AMD EPYC 7402):

mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.071us, confidence interval +- 1.403153%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.610us, confidence interval +- 0.054205%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.715us, confidence interval +- 1.291311%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.

mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.001us, confidence interval +- 0.045569%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.611us, confidence interval +- 0.055383%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.645us, confidence interval +- 0.936527%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.000us, confidence interval +- 0.047267%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.610us, confidence interval +- 0.056379%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.645us, confidence interval +- 1.383868%)
[==========] 3 benchmarks ran.

The difference may be not huge, but reproducible. And on 32bits is predictably even bigger:

mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.134us, confidence interval +- 0.154648%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.319us, confidence interval +- 1.332820%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.536us, confidence interval +- 0.477121%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.129us, confidence interval +- 0.057105%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.241us, confidence interval +- 0.643688%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.532us, confidence interval +- 0.456539%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.136us, confidence interval +- 0.815983%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.281us, confidence interval +- 1.198090%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.564us, confidence interval +- 0.764337%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.

amotin avatar Sep 16 '24 19:09 amotin

Thank you for the benchmark :+1:

On my local Ryzen 7 PRO 5850U, fixed to 1,4GHz it looks like this:

[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 2.163us, confidence interval +- 0.028878%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.221us, confidence interval +- 0.026899%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.221us, confidence interval +- 0.024906%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 2.163us, confidence interval +- 0.029074%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.221us, confidence interval +- 0.029488%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.221us, confidence interval +- 0.032090%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.

On ARM Neoverse-N1 it looks like this:

[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.409us, confidence interval +- 2.150333%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.079us, confidence interval +- 0.106291%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.074us, confidence interval +- 0.153203%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
[mcmilk@lea ~/src/hash]$ ./a.out 
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.397us, confidence interval +- 0.093639%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.077us, confidence interval +- 0.118394%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.072us, confidence interval +- 0.164758%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.

On PPC64LE (POWER10, pvr 0080 0200) - these is no big difference:

[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 0.979us, confidence interval +- 0.092565%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.924us, confidence interval +- 0.100753%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.924us, confidence interval +- 0.099303%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 0.979us, confidence interval +- 0.092653%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.924us, confidence interval +- 0.100670%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.924us, confidence interval +- 0.102227%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.

So yes, it looks okay.

mcmilk avatar Sep 16 '24 20:09 mcmilk

Just for fun:

$ ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 6.005us, confidence interval +- 2.068187%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 4.971us, confidence interval +- 0.022014%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 5.015us, confidence interval +- 2.424526%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
$ grep Model /proc/cpuinfo
Model		: Raspberry Pi 4 Model B Rev 1.4

rincebrain avatar Sep 16 '24 21:09 rincebrain

Yea. At least building with Clang on FreeBSD dnode_hash_city2() and dnode_hash_city4() appeared identical until addition of __noinline to cityhash*() fixed the optimization (AMD EPYC 7402):

Thanks! I did not read the emitted assembly of different tests very carefully. So yes, specializing cityhash does some good stuff.

Harry-Chen avatar Sep 17 '24 03:09 Harry-Chen

Good stuff, thanks for providing the micro benchmark for this. If you can rebase this on master to resolve the conflict it should be good to go.

behlendorf avatar Sep 18 '24 18:09 behlendorf

I've rebased. Seems now CI takes much longer time...

Harry-Chen avatar Sep 19 '24 04:09 Harry-Chen