zig icon indicating copy to clipboard operation
zig copied to clipboard

crypto.sha2: Use intrinsics for SHA-256 on x86-64 and AArch64

Open topolarity opened this issue 1 year ago • 7 comments

There's probably plenty of room to optimize these further in the future, but for the moment this gives ~4-5x improvement on x86-64, and ~10x on M1 AArch64 Macs.

These extensions are very new. Most Intel processors prior to 2020 do not support them. AMD has supported them since Ryzen.

P.S. @kubkon I haven't fixed the CPU features for LLVM in stage2 yet, but at least this gives you something to play with until that's sorted 🙂

topolarity avatar Oct 23 '22 07:10 topolarity

In future it would be nice if we use inline assembly instead of LLVM intrinsics to enable the self-hosted backends to use these features too.

joachimschmidt557 avatar Oct 23 '22 08:10 joachimschmidt557

In future it would be nice if we use inline assembly instead of LLVM intrinsics to enable the self-hosted backends to use these features too.

Good point! Switched to inline asm for this PR.

In general, I think inline assembly can interact very differently with the optimizer versus intrinsics, but in this case it's actually a 20% win for Intel x86-64: Intel is over 1 GB/s now (4x improvement vs. master)

topolarity avatar Oct 23 '22 16:10 topolarity

Eugh... Intel x86-64 loves the new inline assembly, but AMD hates it:

intrinsics: 66870 us (1568.08 MB/s)
inline asm: 251760 us (416.50 MB/s)

I'll go digging through the assembly and see if I can figure out why.

topolarity avatar Oct 23 '22 16:10 topolarity

Eugh... Intel x86-64 loves the new inline assembly, but AMD hates it:

intrinsics: 66870 us (1568.08 MB/s)
inline asm: 251760 us (416.50 MB/s)

I'll go digging through the assembly and see if I can figure out why.

What about arm64? Comparable, or ahead, or behind? If the latter, that's bad timing as I wanted to start testing it out in zld repo.

kubkon avatar Oct 23 '22 17:10 kubkon

What about arm64? Comparable, or ahead, or behind? If the latter, that's bad timing as I wanted to start testing it out in zld repo.

arm64 on my M1 mac is about the same (<5% delta)

topolarity avatar Oct 23 '22 17:10 topolarity

What about arm64? Comparable, or ahead, or behind? If the latter, that's bad timing as I wanted to start testing it out in zld repo.

arm64 on my M1 mac is about the same (<5% delta)

Awesome! Thanks for confirming!

kubkon avatar Oct 23 '22 17:10 kubkon

Is there any trick to building any of this locally? On my M1Pro I am getting:

error: <inline asm>:2:1: instruction requires: sha2
sha256h.4s v2, v1, v3

kubkon avatar Oct 23 '22 17:10 kubkon

not sure if helpful but I have a script that benches SHA hashing comparing BoringSSL's implementation to Zig's implementation

This is for an old build of Zig without this PR's changes 0.10.0-dev.2822+b79884eaf on a Ryzen machine

[SHA256]

     zig: 3.292s
  boring: 518.223ms
     evp: 518.25ms
  evp in: 518.228ms

https://github.com/Jarred-Sumner/bun/blob/dea7cb14bdf0446fcb8a0750fe86b2056a7c3be0/src/sha.zig#L173-L230

Jarred-Sumner avatar Oct 24 '22 05:10 Jarred-Sumner

@kubkon what does zig build-exe --show-builtin output for you?

andrewrk avatar Oct 24 '22 06:10 andrewrk

@kubkon what does zig build-exe --show-builtin output for you?

pub const cpu: std.Target.Cpu = .{                                                                                                                                                                                                                                                                                                                   
    .arch = .aarch64,                                                                                                                                                                                                                                                                                                                                
    .model = &std.Target.aarch64.cpu.apple_a14,                                                                                                                                                                                                                                                                                                      
    .features = std.Target.aarch64.featureSet(&[_]std.Target.aarch64.Feature{                                                                                                                                                                                                                                                                        
        .aes,                                                                                                                                                                                                                                                                                                                                        
        .aggressive_fma,                                                                                                                                                                                                                                                                                                                             
        .alternate_sextload_cvt_f32_pattern,
        .altnzcv,
        .am,
        .arith_bcc_fusion,
        .arith_cbz_fusion,
        .ccdp,
        .ccidx,
        .ccpp,
        .complxnum,
        .contextidr_el2,
        .crc,
        .crypto,
        .disable_latency_sched_heuristic,
        .dit,
        .dotprod,
        .el2vmsa,
        .el3,
        .flagm,
        .fp16fml,
        .fp_armv8,
        .fptoint,
        .fullfp16,
        .fuse_address,
        .fuse_adrp_add,
        .fuse_aes,
        .fuse_arith_logic,
        .fuse_crypto_eor,
        .fuse_csel,
        .fuse_literals,
        .jsconv,
        .lor,
        .lse,
        .lse2,
        .mpam,
        .neon,
        .nv,
        .pan,
        .pan_rwv,
        .pauth,
        .perfmon,
        .predres,
        .ras,
        .rcpc,
        .rcpc_immo,
        .rdm,
        .sb,
        .sel2,
        .sha2,
        .sha3,
        .specrestrict,
        .ssbs,
        .tlb_rmi,
        .tracev8_4,
        .uaops,
        .v8_1a,
        .v8_2a,
        .v8_3a,
        .v8_4a,
        .v8a,
        .vh,
        .zcm,
        .zcz,
        .zcz_gp,
    }),
};

kubkon avatar Oct 24 '22 06:10 kubkon

Doesn't this make it impossible to do sha256 at comptime?

daurnimator avatar Oct 25 '22 04:10 daurnimator

Doesn't this make it impossible to do sha256 at comptime?

Good point, I hadn't thought of that. Yeah, it breaks comptime evaluation for SHA-256.

Without https://github.com/ziglang/zig/issues/868 (or a related hack), the only way to fix that is to create a separate comptime method that users call at comptime. That means it's viral, too: Any function that calls SHA-256 must split into a comptime and runtime version, as well as their callers and so on.

topolarity avatar Oct 25 '22 15:10 topolarity

For hash functions, comptime-support can indeed be an important thing to have.

That leaves us with either dynamic dispatch (which doesn't match the rest, and the plan to eventually do it), or, as you pointed out, a way to check if the function is called at comptime or not.

Doing this, even with a hack as a starter, sounds like a much better route that duplicating every function to have a comptine variant.

jedisct1 avatar Oct 25 '22 18:10 jedisct1

pub fn isComptime() bool {
    var t = true;
    var a: u8 = 0;
    var b: u16 = 0;
    const x = if (t) a else b;
    return @TypeOf(x) == u8;
}

andrewrk avatar Oct 28 '22 01:10 andrewrk

That's a nice trick!

jedisct1 avatar Oct 28 '22 13:10 jedisct1

CMake Error: The following variables are used in this project, but they are set to NOTFOUND. Please set them or make sure they are set and tested correctly in the CMake files: ZSTD

Hmm not sure what's up with the CI failures - I don't understand why this would be happening in this branch but not master branch.

Are you sure this is rebased against master branch? In master branch I see this:

-- The C compiler identification is Clang 15.0.3
-- The CXX compiler identification is Clang 15.0.3

However in this CI run I see this:

-- The C compiler identification is Clang 15.0.0
-- The CXX compiler identification is Clang 15.0.0

Possibly indicating an old tarball.

andrewrk avatar Oct 28 '22 22:10 andrewrk

Are you sure this is rebased against master branch?

Oops, I thought CI merged the PR branch into master automatically, but I see now that this doesn't apply changes to the CI script itself.

Just pushed a proper rebase - should be ready for review/merge assuming it passes.

topolarity avatar Oct 28 '22 22:10 topolarity

Re: the CI failures, I think

!isComptime() and comptime std.Target.x86.featureSetHas(builtin.cpu.features, .sha)

has to be changed to

comptime std.Target.x86.featureSetHas(builtin.cpu.features, .sha) and !isComptime()

See #6768 for more details.

Once we get rid of stage1, we can make featureSetHas inline and then that comptime keyword can go away.

andrewrk avatar Oct 29 '22 00:10 andrewrk

Thanks for the review! Yep, #6768 took me by surprise.

Should be all fixed up now :+1:

topolarity avatar Oct 29 '22 00:10 topolarity

Doesn't this make it impossible to do sha256 at comptime?

Good point, I hadn't thought of that. Yeah, it breaks comptime evaluation for SHA-256.

Without #868 (or a related hack), the only way to fix that is to create a separate comptime method that users call at comptime. That means it's viral, too: Any function that calls SHA-256 must split into a comptime and runtime version, as well as their callers and so on.

I don't think it's necessary to add another comptime method. I made some changes like below and run

zig test lib/std/std.zig --zig-lib-dir lib

image

The tests built and passed. The hash functions should already support running at comptile time, no need to add another one, am I right?

xcaptain avatar Nov 10 '23 16:11 xcaptain