asm-hashes Improved sha256 and sha512 assembler code versions available for x86 / x64

I recently noticed that @nayuki , the author of some of the assembler code used in this crate, has published performance-improved versions of their code at https://github.com/nayuki/Nayuki-web-published-code/tree/master/fast-sha2-hashes-in-x86-assembly in 2024. The different improvement steps are described in the git commit messages.

I'm aware of the "maintenance mode" status of the asm-hashes repository and the general goal of moving to Rust with inline assembly, but still wanted to flag this potential code improvement to the maintainers and potential other users who are interested in this crate.

For sha2/src/sha512_x64.S, the net performance improvement seen on two different AMD Zen3 CPUs was minor, in the range of ~1%. Performance improvements could be different on other CPUs or x86 architectures, though.

For modern CPUs produced in the last ~10 years, there are significant additional speedups possible, which I'll document in a separate issue.

Jan 25 '25 11:01 invd

The new SHA512 instructions should be leveraged using intrinsics in the sha2 crate. Unfortunately, the relevant intrinsics and target features are currently unstable, so this new backend would have to be experimental (i.e gated on a crate feature or configuration flag).

The linked assembly could be a good reference point and we may use it for an asm!-based implementation in sha2, but we do not plan to use .S files in future versions of our crates.

Jan 25 '25 13:01 newpavlov

@newpavlov thanks for your quick feedback. To clarify, this particular issue #82 is about "there are slightly improved versions available for the code you already use from Nayuki", which would be a drop-in replacement except for minor details (#if snippets, whitespace). As far as I'm aware, they do not make use of any new instruction types.

#83 is about "you could use other asm code implementations with modern CPU acceleration instructions" in asm-hashes, which would probably lead to a +50% improvement for SHA512, and even more for SHA1/SHA256. I understand that this is a heavier lift and doesn't match your strategic plans, but wanted to document the possibility and related observations on the lower-than-expected performance of asm-hashes. From your strategic perspective of moving users to a native hashes solution, the documentation aspect of "asm-hashes is already slower than the native code on some hardware" is perhaps the most relevant to you.

The third aspect is potentially improving the native hashes implementation to be more competitive with modern asm code in other projects - that's probably best discussed in the other repository 🙂

Jan 25 '25 13:01 invd

he new SHA512 instructions should be leveraged using intrinsics in the sha2 crate. Unfortunately, the relevant intrinsics and target features are currently unstable, so this new backend would have to be experimental (i.e gated on crate feature or configuration flag).

We can do something similar to what we did for ARM for awhile, and "polyfill" the unstable intrinsics by making them wrappers for small bits of inline assembly which would otherwise be emitted by the intrinsics. Though, are ZMM registers still unstable?

Jan 25 '25 17:01 tarcieri

_mm256_sha512* intrinsics work on YMM registers, so I guess it should be possible to polyfill them. Also, surprisingly, std does not have those intrinsics yet (or they were removed).

Jan 25 '25 21:01 newpavlov

It looks like they're available now: https://doc.rust-lang.org/nightly/core/arch/x86/fn._mm256_sha512rnds2_epi64.html

May 18 '25 01:05 tarcieri