zfs icon indicating copy to clipboard operation
zfs copied to clipboard

Remove unused Edon-R variants

Open mcmilk opened this issue 2 years ago • 24 comments

This commit removes the edonr_byteorder.h file and all unused variants of Edon-R.

Signed-off-by: Tino Reichardt [email protected]

Motivation and Context

Simplify the code and generate little speedups also ;-)

FreeBSD freebsd 13.1-RELEASE with clang 13.0.0:
implementation               1k      4k     16k     64k    256k      1m      4m     16m
edonr-generic-master       1982    2330    2435    2444    2468    2476    2130    2138
edonr-generic-newone       2202    2567    2679    2657    2618    2430    2373    2461

Debian Linux 5.10.0-16-amd64 with gcc version 10.2.1:
implementation               1k      4k     16k     64k    256k      1m      4m     16m
edonr-generic-master       1898    2495    2650    2748    2778    2700    2737    2739
edonr-generic-newone       2144    2750    2946    2987    2973    2805    2856    2978

The size of the compiled edonr.o with debugging went down from 574,128 to 111,672 bytes.

Description

How Has This Been Tested?

I have run the standard testing procedures on a test system... the CI seems also fine.

Types of changes

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [ ] New feature (non-breaking change which adds functionality)
  • [x] Performance enhancement (non-breaking change which improves efficiency)
  • [x] Code cleanup (non-breaking change which makes code smaller or more readable)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • [ ] Documentation (a change to man pages or other documentation)

Checklist:

  • [x] My code follows the OpenZFS code style requirements.
  • [ ] I have updated the documentation accordingly.
  • [x] I have read the contributing document.
  • [ ] I have added tests to cover my changes.
  • [x] I have run the ZFS Test Suite with this change applied.
  • [x] All commit messages are properly formatted and contain Signed-off-by.

mcmilk avatar Jul 02 '22 09:07 mcmilk

I think that ignoring these compiler warnigs may also a reason for memory corruption on i386 and maybe other small stack architectures. I will rework this patch the same way as done with blake3 by pre-allocating a per-cpu space.

mcmilk avatar Jul 02 '22 09:07 mcmilk

Have you seen memory corruption on i386 with this, then?

I've seen plenty of issues with it panicking on i386, but none of them seemed to be from memory corruption so far...

rincebrain avatar Jul 02 '22 13:07 rincebrain

Have you seen memory corruption on i386 with this, then?

I've seen plenty of issues with it panicking on i386, but none of them seemed to be from memory corruption so far...

I want it to fit into default stack size ... and remove the pragma then. Maybe this is some source of memory corruption, but I have not checked this in detail. The result for for overwriting stacks is mostly pure random.

When I get it into the stack space, the result is slower also :/

mcmilk avatar Jul 02 '22 14:07 mcmilk

Just say "I dislike pragmas", if that's why you're doing it.

rincebrain avatar Jul 02 '22 14:07 rincebrain

Checkstyle is fine, the benchmarks show the same values.... and I could minimize the stack size also a bit. The -Wframe-larger-than= pragma is needed for gcc: error: the frame size of 1356 bytes is larger than 1024 bytes Compiling the same file with clang, the stack size is fine - I have no idea why.

mcmilk avatar Jul 02 '22 20:07 mcmilk

I'd look at precisely what flags the kernel is passing to compile with make V=1, but in particular, I don't think many versions of clang know what-Wframe-larger-than is. (I see some discussion from 2021 about adding various stack usage warning limits, but that's newer than a number of distros' compiler versions, so I wouldn't imagine the kernel would be happy to rely on that working yet.)

rincebrain avatar Jul 02 '22 22:07 rincebrain

I'd look at precisely what flags the kernel is passing to compile with make V=1, but in particular, I don't think many versions of clang know what-Wframe-larger-than is. (I see some discussion from 2021 about adding various stack usage warning limits, but that's newer than a number of distros' compiler versions, so I wouldn't imagine the kernel would be happy to rely on that working yet.)

I am using gcc version 10.2.1 and clang version 11.0.1 currently. The -Wframe-larger-than is supported and I used -Wframe-larger-than=1 for getting infos about the stack usage of edonr.c (userspace, without ZFS headers).

Results:

# gcc 32bit seems stupid:
[mcmilk@latitude ~]$ gcc -c -Os -m32 -W -Wframe-larger-than=256 edonr.c
edonr.c: In function ‘Q512’:
edonr.c:217:1: warning: the frame size of 1328 bytes is larger than 256 bytes [-Wframe-larger-than=]
edonr.c: In function ‘EdonRHash’:
edonr.c:325:1: warning: the frame size of 400 bytes is larger than 256 bytes [-Wframe-larger-than=]

[mcmilk@latitude ~]$ gcc -c -m32 -O0 -W -Wframe-larger-than=256 edonr.c
edonr.c: In function ‘Q512’:
edonr.c:217:1: warning: the frame size of 2128 bytes is larger than 256 bytes [-Wframe-larger-than=]
edonr.c: In function ‘EdonRHash’:
edonr.c:325:1: warning: the frame size of 400 bytes is larger than 256 bytes [-Wframe-larger-than=]

[mcmilk@latitude ~]$ gcc -c -m32 -O1 -W -Wframe-larger-than=256 edonr.c
edonr.c: In function ‘Q512’:
edonr.c:217:1: warning: the frame size of 1304 bytes is larger than 256 bytes [-Wframe-larger-than=]
edonr.c: In function ‘EdonRHash’:
edonr.c:325:1: warning: the frame size of 400 bytes is larger than 256 bytes [-Wframe-larger-than=]

[mcmilk@latitude ~]$ gcc -c -m32 -O2 -W -Wframe-larger-than=256 edonr.c
edonr.c: In function ‘Q512’:
edonr.c:217:1: warning: the frame size of 1304 bytes is larger than 256 bytes [-Wframe-larger-than=]
edonr.c: In function ‘EdonRHash’:
edonr.c:325:1: warning: the frame size of 400 bytes is larger than 256 bytes [-Wframe-larger-than=]

[mcmilk@latitude ~]$ gcc -c -m32 -O3 -W -Wframe-larger-than=256 edonr.c
edonr.c: In function ‘Q512’:
edonr.c:217:1: warning: the frame size of 1304 bytes is larger than 256 bytes [-Wframe-larger-than=]
edonr.c: In function ‘EdonRHash’:
edonr.c:325:1: warning: the frame size of 432 bytes is larger than 256 bytes [-Wframe-larger-than=]

# gcc 64bit ok:
[mcmilk@latitude ~]$ gcc -c -Os -m64 -W -Wframe-larger-than=256 edonr.c
edonr.c: In function ‘EdonRHash’:
edonr.c:325:1: warning: the frame size of 400 bytes is larger than 256 bytes [-Wframe-larger-than=]

# clang, all ok:
[mcmilk@latitude ~]$ clang -c -Os -m32 -W -Wframe-larger-than=256 edonr.c
edonr.c:121:1: warning: stack frame size of 432 bytes in function 'Q512' [-Wframe-larger-than=]
edonr.c:318:1: warning: stack frame size of 412 bytes in function 'EdonRHash' [-Wframe-larger-than=]

[mcmilk@latitude ~]$ clang -c -Os -m64 -W -Wframe-larger-than=256 edonr.c
edonr.c:318:1: warning: stack frame size of 424 bytes in function 'EdonRHash' [-Wframe-larger-than=]

mcmilk avatar Jul 03 '22 07:07 mcmilk

Neat, so it believes in it.

It seems like there's no difference in configuration for that flag between compilers, so my only guess would be if Clang is doing a better job at stack optimization...

I'm curious how gcc is behaving so badly at it, in your pasted example for -m32. Maybe clang is playing loose with what's required to be on the stack? Dunno. Might also be interesting to compare with -O2, since I believe that's what the kernel defaults to...

rincebrain avatar Jul 03 '22 07:07 rincebrain

Neat, so it believes in it.

It seems like there's no difference in configuration for that flag between compilers, so my only guess would be if Clang is doing a better job at stack optimization...

I'm curious how gcc is behaving so badly at it, in your pasted example for -m32. Maybe clang is playing loose with what's required to be on the stack? Dunno. Might also be interesting to compare with -O2, since I believe that's what the kernel defaults to...

The comment above got an update...

Each call within Q512() generates some used bytes on the stack at x32 (for gcc) ... when I remove some rounds in the end ... the stack gets under 1024 ... who cares ;-)

mcmilk avatar Jul 03 '22 07:07 mcmilk

@rincebrain - could you make a Code Review? I haven't touched much code... mainly removing the unused variants and some re-using of variables by introducing these ones: uint64_t z1, z2, z3, z4, z5, z6, z7, z8;

mcmilk avatar Jul 11 '22 13:07 mcmilk

I'll try to get to actually looking it over in depth later this week, but when I tested it, it didn't demonstrate any obvious incorrectness on a 100 GB edonr dataset generated with the old code...

I'll also try and give it a spin on my BE systems just to be safe.

rincebrain avatar Jul 21 '22 11:07 rincebrain

I rebased again, cause the unused define EDONR_VALID_HASHBITLEN was left...

mcmilk avatar Jul 22 '22 09:07 mcmilk

Sorry, it turned out to be a busy week (your primary home server dying does that), and I haven't had a chance to look at this.

I am, frankly, more likely to go fix something broken when I get time first, though...

rincebrain avatar Aug 01 '22 11:08 rincebrain

  • Rich Ercolani schrieb:

Sorry, it turned out to be a busy week (your primary home server dying does that), and I haven't had a chance to look at this.

No problem.

I am, frankly, more likely to go fix something broken when I get time first, though...

I am working on this.... but it isn't so easy.

mcmilk avatar Aug 01 '22 12:08 mcmilk

Oh? You almost certainly have more state on this than me, but since you never answered why you invented your own complex system for it, I was going to just go copy the fletcher4 toggle and see what broke.

rincebrain avatar Aug 01 '22 12:08 rincebrain

Oh? You almost certainly have more state on this than me, but since you never answered why you invented your own complex system for it, I was going to just go copy the fletcher4 toggle and see what broke.

I am using a skeleton now, first start is here: https://github.com/openzfs/zfs/compare/master...mcmilk:zfs:sha2-rework Edit: to answer your question why, because I wanted to have a check for validity of all methods outside of the kernel... thats why the switching between the impls was done this way... for the module load I will take the fletcher-4 as skeleton

mcmilk avatar Aug 01 '22 12:08 mcmilk

@rincebrain - Sorry for rebasing this branch a little to often. The patch limits the -Wframe-larger-than= pragma to 32-bit architectures - should this be mentioned in the commit message?

mcmilk avatar Aug 11 '22 03:08 mcmilk

Potentially.

As far as apologizing, I appreciate it, but please don't worry about it - I wasn't trying to suggest you were doing something wrong, plenty of people just push the whole thing over and over again, just explaining why I was waiting for it to settle down...

I'll try to remember to give this a once over in the next 24h, and a runthrough on my small menagerie of unusual systems.

rincebrain avatar Aug 11 '22 09:08 rincebrain

Potentially.

As far as apologizing, I appreciate it, but please don't worry about it - I wasn't trying to suggest you were doing something wrong, plenty of people just push the whole thing over and over again, just explaining why I was waiting for it to settle down...

I'll try to remember to give this a once over in the next 24h, and a runthrough on my small menagerie of unusual systems.

I just rebased against master and updated one comment and some c styling issues. In detail: statements like these: x1+x2 got written like this x1 + x2 now.

mcmilk avatar Aug 27 '22 07:08 mcmilk

PING

mcmilk avatar Sep 12 '22 20:09 mcmilk

Sure, sorry, I kind of burnt out on working on OpenZFS for a bit so I took a break.

I've got this branch compiling on my sparc and x86en and will give it a runthrough tonight, then review it if nothing blows up sometime in the next couple days.

My concern, though, is that this seems like a nontrivial rework of something high-impact, and those benchmark improvements seem smaller than the amounts I've seen those benchmarks vary across module loads. So I'm worried about whether the improvements are worth it? I'm obviously not the final authority on anything, though.

rincebrain avatar Sep 12 '22 21:09 rincebrain

Lest you think I forgot, it made my SPARC panic. Sadly, it somehow failed to log the trace for why, so I'm having to run it again to see if it's unrelated.

rincebrain avatar Sep 13 '22 19:09 rincebrain

Lest you think I forgot, it made my SPARC panic. Sadly, it somehow failed to log the trace for why, so I'm having to run it again to see if it's unrelated.

I created a small testing tool: https://github.com/mcmilk/edonr-testing

The old vs new code got checked on SPARC64 and PPC64 ...

Edit - I forget some benchmarks ;) Edit2 - the benchmarks are here now: https://github.com/mcmilk/edonr-testing

mcmilk avatar Sep 13 '22 21:09 mcmilk

Interesting, something is absolutely smashing the stack on my SPARC when I run the test suite, and panicking in random places all over Linux as a result.

Time to run it again on the vanilla tree and see if it keeps doing it.

rincebrain avatar Sep 13 '22 21:09 rincebrain

Rebased to current master.

mcmilk avatar Oct 07 '22 13:10 mcmilk

@behlendorf - this PR removes in the first place the unused 224, 256 and 384 bit variants of EdonR. It also renames the quasi_exform512() macro to Q512() and adds these shared variables for all three macros: uint64_t z1, z2, z3, z4, z5, z6, z7, z8; Before we had 8 extra variables per macro ... LS1_512() LS2_512() and QEF_512()

These changes do a little speed up on all platforms, but not the sparc64. There the 'more variables' code is a bit faster. What do you think about this code cleanup?

If it's not needed, cause nobody uses EdonR ... I could just close it anyway ;-)

mcmilk avatar Oct 13 '22 20:10 mcmilk

I'm not Brian, but I at least use EdonR when I want nopwrite. :)

I apparently have been bad at replying directly in here, but the smashing on sparc64 had nothing to do with this PR, to be clear, it just...does that now, I guess. :/

My concern, which I don't think I posted already, sorry, is that while I don't think any of the cleanup is particularly controversial and you could probably convince something to check and agree that the changes don't change the behavior for any codepaths that are hit, if you were really motivated in that way, I just get worried about any sort of large-scale changes to anything core like checksums or compression (...I say with my fingers in many branches of both pies), and I'm wondering how much the cleanup and performance delta is worth any risk of changing it, even marginal. (Normally I'd also be worried about ever syncing with upstream, but I think we can all agree that's never changing again.)

I also tend to optimize toward avoiding risk when giving advice outside of what I do to my own systems, though, so perhaps I'm being overcautious.

rincebrain avatar Oct 13 '22 21:10 rincebrain

@mcmilk I'm sure edonr is used and nobody would complain about a speed up! But I do share @rincebrain's concerns above about this being a non-trivial change. Creating a testing tool is great, but can you also do some some manual testing if you haven't already. I'm thinking something along the lines of creating a pool on one platform, filling it with some data, then importing the pool on other platforms and scrub it to verify those implementations. It seems like that should catch anything seriously broken.

behlendorf avatar Oct 13 '22 21:10 behlendorf

@mcmilk I'm sure edonr is used and nobody would complain about a speed up! But I do share @rincebrain's concerns above about this being a non-trivial change. Creating a testing tool is great, but can you also do some some manual testing if you haven't already. I'm thinking something along the lines of creating a pool on one platform, filling it with some data, then importing the pool on other platforms and scrub it to verify those implementations. It seems like that should catch anything seriously broken.

I only have x86_64 arch VMs with full root access. Sth. like Arm, Aarch64, PPC and so on is only userspace testing. But I cold ask for a PPC64 VM @ https://openpower.ic.unicamp.br/minicloud/ ... maybe we could also get a buildbot VM there.

Has someone of you (@rincebrain @behlendorf) already tried to get a machine there?

My Raspberry Pi3 may also an option... I forgot. Will import with old zfs there... this should be a simple testcase... will do this in the weekend.

mcmilk avatar Oct 14 '22 05:10 mcmilk

I've not, but I've heard it's straightforward and have suggested adding a CI bot with one.

I do have sparc64 and aarch64 laying around my home, and usually have a ppc64le VM but the host is off for maintenance atm.

rincebrain avatar Oct 14 '22 05:10 rincebrain