zfs [WIP] ICP: AES_GCM: add support for SSE4.1 acceleration

Motivation and Context

Small NAS solution are often based on hardware with no AVX support, where encryption performs subpar. Most of them do support SSE4.1 on the other hand. Supporting SSE4.1 will give them a massive boost in aes-gcm crypto performance.

Description

This is still a draft, a detailed description will be added once this is ready for merging.

Closes #10846

How Has This Been Tested?

Inside a VM two fio invocations on an encrypted datasets and mprime -t were run in parallel. Various other tests were done as well. Finally, this PR contains debugging code which runs the AVX and SSE implementations on the same context an compares the results (See the DEBUG_GCM_ASM #define).

Types of changes

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[x] Performance enhancement (non-breaking change which improves efficiency)
[ ] Code cleanup (non-breaking change which makes code smaller or more readable)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
[ ] Documentation (a change to man pages or other documentation)

Checklist:

[x] My code follows the OpenZFS code style requirements.
[ ] I have updated the documentation accordingly.
[x] I have read the contributing document.
[ ] I have added tests to cover my changes.
[ ] I have run the ZFS Test Suite with this change applied.
[x] All commit messages are properly formatted and contain Signed-off-by.

Feb 25 '23 01:02 AttilaFueloep

While it's still a draft, I took care to squash the commits in a way which should make reviewing this straight forward.

I did quite some testing without revealing any problems, but at this point I'd appreciate any help on testing on a wider range of systems. If you need any help on how to test this, please drop a note here and I'll instruct. If you want to test this, in any case please do it on a test VM with a test pool since this code could permanently corrupt any pools accessed from it.

There's still some cleanup an some minor refinements left, so this is a draft for now. I'm also wondering if I should keep the debugging code in the #ifdef DEBUG_GCM_ASM conditionals in the final PR. I'm leaning towards it since it would ease developing of the planned follow-ups for AVX2 and maybe VAES. They could be removed once the dust settles and the code gets good usage coverage.

Feb 25 '23 01:02 AttilaFueloep

The people at ixsystems may be interested in testing this too. Surely they'll have a far better test infrastructure to test this on than I do.

Feb 25 '23 01:02 AttilaFueloep

@lundman I took major efforts to make this clang compatible on Linux. If you have a spare hour you may want to test if it works on macOS and Windows as well.

Feb 25 '23 02:02 AttilaFueloep

One important point regarding testing I missed above:

If you are testing this on hardware which has AVX support you need to activate SSE support by doing

# echo  sse4_1 >/sys/module/zfs/parameters/icp_gcm_impl

or setting it on module load time.

Feb 25 '23 02:02 AttilaFueloep

Excellent, I will give it a go on macOS and Windows.

Feb 25 '23 03:02 lundman

Thank you a lot for your work!

I have a ReadyNAS 426 with an Intel(R) Atom(TM) CPU C3538 @ 2.10GHz without AVX with SSE3, SSE4.1 & SSE4.2. I'd be willing to test your implementation.

I have 6 × 8TB drives in ZFS for storage (ZFS directly uses the disks and autoformatted them to GPT: 7.3T ZFS + 8M reserved) and 2 × 64GB High Endurance microSD cards in RAID with debian bookworm.

Could you please provide instructions on how I can best test this PR?

Feb 25 '23 09:02 oberien

I did a light fio run on a small Celeron J3455 [Goldmont] board with SSE4.2 but no AVX. Using a single-vdev tmpfs-backed pool, no compression:

generic: ~3.7MB/s
pclmulqdq: ~6.1MB/s
sse4_1: ~13.2MB/s
no encryption: ~17MB/s

fio was: fio --name=xxx --size=4K --bs=4K --rw=randrw --ioengine=psync --sync=1 --iodepth=32 --numjobs=1 --direct=1 --end_fsync=1 --gtod_reduce=1 --time_based --runtime=60

Hardly science, but those initial numbers seem pretty good. Strong work!

Feb 25 '23 10:02 robn

Thank you @AttilaFueloep, I too have very promising results on a C3558 system running Debian Bullseye under KVM hypervisor and WD Black SN750 NVMe disk.

In my tests below I just used dd to read a big file with arc size limited to 500MB.

Unencrypted read speed was 909MB/s, the old pclmulqdq implementation did 86MB/s, the new sse4_1 implementation increased speed to 670MB/s! Enabling compression, the speed stayed roughly the same. For reference, FreeBSD achieved roughly similar speed (660MB/s) in my previous tests on a similar setup.

# uname -a
Linux labdebian 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux

# echo 524288000 >/sys/module/zfs/parameters/zfs_arc_max
# zfs create -o compression=off test/noenc
# zfs create -o encryption=on -o keyformat=passphrase -o compression=on test/enc2
# zfs create -o encryption=on -o keyformat=passphrase -o compression=off test/enc3
# cp /mnt/media/test.tar /test/noenc/
# cp /mnt/media/test.tar /test/enc3/

# dd if=/test/noenc/test.tar bs=128k of=/dev/null status=progress
6319505408 bytes (6.3 GB, 5.9 GiB) copied, 7 s, 903 MB/s
52328+1 records in
52328+1 records out
6858762240 bytes (6.9 GB, 6.4 GiB) copied, 7.54886 s, 909 MB/s

# echo generic > /sys/module/zfs/parameters/icp_gcm_impl
# dd if=/test/enc3/test.tar bs=128k of=/dev/null status=progress
time dd if=/test/enc3/test.tar bs=128k of=/dev/null status=progress
6840385536 bytes (6.8 GB, 6.4 GiB) copied, 189 s, 36.2 MB/s
52328+1 records in
52328+1 records out
6858762240 bytes (6.9 GB, 6.4 GiB) copied, 189.506 s, 36.2 MB/s

# echo pclmulqdq > /sys/module/zfs/parameters/icp_gcm_impl
# dd if=/test/enc3/test.tar bs=128k of=/dev/null status=progress
6820200448 bytes (6.8 GB, 6.4 GiB) copied, 79 s, 86.3 MB/s
52328+1 records in
52328+1 records out
6858762240 bytes (6.9 GB, 6.4 GiB) copied, 79.4401 s, 86.3 MB/s

# echo sse4_1 > /sys/module/zfs/parameters/icp_gcm_impl
# dd if=/test/enc3/test.tar bs=128k of=/dev/null status=progress
6715080704 bytes (6.7 GB, 6.3 GiB) copied, 10 s, 672 MB/s
52328+1 records in
52328+1 records out
6858762240 bytes (6.9 GB, 6.4 GiB) copied, 10.1724 s, 674 MB/s

# dd if=/test/enc2/test.tar bs=128k of=/dev/null status=progress
6629883904 bytes (6.6 GB, 6.2 GiB) copied, 10 s, 663 MB/s
52328+1 records in
52328+1 records out
6858762240 bytes (6.9 GB, 6.4 GiB) copied, 10.2971 s, 666 MB/s

Feb 26 '23 02:02 AdamLantos

@oberien Well, I'd suggest the following:

Spin up a VM which has all the dependencies for building ZFS installed, but not the ZFS packages itself.

Inside the VM do:

git clone http://github.com/openzfs/zfs/
cd zfs
git pull origin pull/14531/head
./autogen.sh && ./configure --prefix=/ --enable-debuginfo --enable-debug && make

Open a new root terminal inside the VM since for me the installation step failed with sudo.

cd <dir where you build zfs>
make install
modprobe zfs
echo sse4_1 >/sys/module/zfs/parameters/icp_gcm_impl
zpool create ...
zfs create -o encryption= ...

Run your favorite workloads and benchmark tools.

Feb 26 '23 11:02 AttilaFueloep

I tested using an 8GB Ramdisk both on a Debian host (shipped ZFS) and in a Debian VM (zfs compiled from this PR).

Interestingly, ./configure states:

checking whether host toolchain supports AVX... yes
checking whether host toolchain supports AVX2... yes
checking whether host toolchain supports AVX512F... yes

Setup VM (on Debian):

sudo apt install --no-install-recommends qemu-system libvirt-clients libvirt-daemon-system virtinst dnsmasq
virsh net-start default
virt-install --virt-type kvm --name bullseye-amd64 \
--location http://deb.debian.org/debian/dists/bullseye/main/installer-amd64/ \
--os-variant debian11 \
--disk size=10 --memory 10000 \
--graphics none \
--console pty,target_type=serial \
--extra-args "console=ttyS0"

Setup ZFS:

sudo apt install build-essential git automake libtool gettext zlib1g-dev uuid-dev libblkid-dev libssl-dev linux-headers-$(uname -r) libaio-dev libudev-dev
git config --global pull.rebase true
git config --global user.email [email protected]
git clone https://github.com/openzfs/zfs/
cd zfs
git pull origin pull/14531/head
./autogen.sh && ./configure --prefix=/ --enable-debuginfo --enable-debug && make 
make install
depmod
modprobe zfs
echo sse4_1 >/sys/module/zfs/parameters/icp_gcm_impl

Test code:

zfs --version
mkdir /tmp/tmpfs
mount -t tmpfs none /tmp/tmpfs
truncate -s 8G /tmp/tmpfs/test
zpool create -o ashift=12 -O mountpoint=none test /tmp/tmpfs/test

# without encryption
zfs create -o atime=off -o compression=off -o mountpoint=/test test/test
dd if=/dev/zero of=/test/test bs=1M count=4k
zpool export test && zpool import -d /tmp/tmpfs/test test -l
dd if=/test/test of=/dev/null bs=1M count=4096
zfs destroy test/test

# with encryption
dd if=/dev/urandom of=/tmp/test_key bs=1 count=32
zfs create -o encryption=aes-256-gcm -o keyformat=raw -o keylocation=file:///tmp/test_key -o atime=off -o compression=off -o mountpoint=/test test/test
dd if=/dev/zero of=/test/test bs=1M count=4k
zpool export test && zpool import -d /tmp/tmpfs/test test -l
dd if=/test/test of=/dev/null bs=1M count=4096
zfs destroy test/test

# cleanup
zpool destroy test
rmdir /test
rm /tmp/test_key
rm /tmp/tmpfs/test
umount /tmp/tmpfs
rmdir /tmp/tmpfs

Results:

For comparison `perf bench mem memcpy`:
Host: 2.64GB/s
Guest: 1.93GB/s

Output host unencrypted:
zfs-2.1.9-1
zfs-kmod-2.1.7-1
write: 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 7.29287 s, 589 MB/s
read:  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 4.66974 s, 920 MB/s

Output host encrypted:
write: 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 27.5948 s, 156 MB/s
read:  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 22.4259 s, 192 MB/s

Output guest unencrypted:
zfs-2.1.99-1750_g57314b5ec
zfs-kmod-2.1.99-1750_g57314b5ec
write: 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 9.31044 s, 461 MB/s
read:  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 7.15297 s, 600 MB/s

Output guest encrypted:
write: 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 14.0712 s, 305 MB/s
read:  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 15.1526 s, 283 MB/s

tldr: Host without encryption is 150% as fast as the VM. Host with encryption is about 50% as fast as the VM. Assuming the unencrypted difference is due to VM overhead, I assume host with encryption with this patch to be in the ballpark of ~3× as fast as before.

Feb 26 '23 22:02 oberien

@oberien

Interestingly, ./configure states:

checking whether host toolchain supports AVX... yes
checking whether host toolchain supports AVX2... yes
checking whether host toolchain supports AVX512F... yes

configure just checks if your toolchain has support for SIMD instructions. The existence of SIMD instructions is checked at runtime (e.g zfs_avx_available())

Feb 27 '23 08:02 AttilaFueloep

macOS changes to compile:

https://github.com/openzfsonosx/openzfs-fork/commit/df3931c27302acbe2f59a0b9e72d5e474566e968

The FN_NAME fix is a quick hack, we should discuss that sometime.

# sysctl kstat.zfs.darwin.tunable.icp_gcm_impl=sse4_1
kstat.zfs.darwin.tunable.icp_gcm_impl: cycle [fastest] sse4_1 generic pclmulqdq  -> cycle fastest [sse4_1] generic pclmulqdq

# ./zfs create -o encryption=aes-256-gcm -o keyformat=passphrase BOOM/gcm
Enter new passphrase:
Re-enter new passphrase:

panic

0xffffff801e72b0f0 : 0xffffff801678ba40 mach_kernel : trap_from_kernel + 0x26
0xffffff801e72b110 : 0xffffff7f99e92947 org.openzfsonosx.zfs : _initial_num_blocks_is_2_34 + 0x15c
0xffffff801e72b370 : 0xffffff7f99d51d56 org.openzfsonosx.zfs : _aes_encrypt_contiguous_blocks + 0x36
0xffffff801e72b390 : 0xffffff7f99d3a1c1 org.openzfsonosx.zfs : _crypto_update_uio + 0x251
0xffffff801e72b3f0 : 0xffffff7f99d3ac21 org.openzfsonosx.zfs : _aes_encrypt_atomic + 0x181
0xffffff801e72b510 : 0xffffff7f99d387cd org.openzfsonosx.zfs : _crypto_encrypt + 0xdd
0xffffff801e72b590 : 0xffffff7f99d1b880 org.openzfsonosx.zfs : _zio_do_crypt_uio + 0x1c0
0xffffff801e72b6b0 : 0xffffff7f99d1b693 org.openzfsonosx.zfs : _zio_crypt_key_wrap + 0x153

frame #0: 0xffffff7f99e92947 zfs`initial_num_blocks_is_2_34 at isalc_gcm_sse.S:2342
(lldb) disass
zfs`initial_num_blocks_is_2_34:
    0xffffff7f99e927eb <+0>:    movdqu %xmm8, %xmm6
    0xffffff7f99e927f0 <+5>:    movdqu 0x10(%rdi), %xmm9
    0xffffff7f99e927f6 <+11>:   paddd  0xbd231(%rip), %xmm9      ; ZERO + 14
    0xffffff7f99e927ff <+20>:   movdqa %xmm9, %xmm7
    0xffffff7f99e92804 <+25>:   pshufb 0xbd1b3(%rip), %xmm7      ; TWOONE + 62
    0xffffff7f99e9280d <+34>:   paddd  0xbd21a(%rip), %xmm9      ; ZERO + 14
    0xffffff7f99e92816 <+43>:   movdqa %xmm9, %xmm8
    0xffffff7f99e9281b <+48>:   pshufb 0xbd19b(%rip), %xmm8      ; TWOONE + 62
    0xffffff7f99e92825 <+58>:   movdqu (%r8), %xmm0
    0xffffff7f99e9282a <+63>:   pxor   %xmm0, %xmm7
...
    0xffffff7f99e92936 <+331>:  movdqu (%rdx,%r11), %xmm12
    0xffffff7f99e9293c <+337>:  pxor   %xmm12, %xmm7
    0xffffff7f99e92941 <+342>:  movdqu %xmm7, (%rsi,%r11)
->  0xffffff7f99e92947 <+348>:  addq   0x6, %r11
    0xffffff7f99e9294f <+356>:  pshufb 0xbd068(%rip), %xmm7      ; TWOONE + 62
    0xffffff7f99e92958 <+365>:  movdqu (%rdx,%r11), %xmm12
    0xffffff7f99e9295e <+371>:  pxor   %xmm12, %xmm8

show registers

(lldb) register read -all
General Purpose Registers:
       rax = 0xffffff7f99f5adf0  zfs`isalc_ops
       rbx = 0xffffff90c8094ae8
       rcx = 0x0000000000000020
       rdx = 0xffffff801e72b950
       rdi = 0xffffff801e72b408
       rsi = 0xffffff90c8094ae8
       rbp = 0xffffff801e72b370
       rsp = 0xffffff801e72b200
        r8 = 0xffffff90c9c77a00
        r9 = 0xffffff90c96f7840
       r10 = 0x0000000000000020
       r11 = 0x0000000000000000
       r12 = 0x0000000000000002
       r13 = 0x0000000000000020
       r14 = 0xffffff801e72b280
       r15 = 0x0000000000000000
       rip = 0xffffff7f99e92947  zfs`initial_num_blocks_is_2_34 + 348
    rflags = 0x0000000000010246
        cs = 0x0000000000000008
        fs = 0x00000000ffff0000
        gs = 0x000000001e720000

Floating Point Registers:
       fcw = 0x0000
       fsw = 0x0000
       ftw = 0x00
       fop = 0x0000
        ip = 0x00000000
        cs = 0x0000
        dp = 0x00000000
        ds = 0x0000
     mxcsr = 0x00000000
  mxcsrmask = 0x00000000
     stmm0 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     stmm1 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     stmm2 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     stmm3 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     stmm4 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     stmm5 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     stmm6 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     stmm7 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
      xmm0 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
      xmm1 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
      xmm2 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
      xmm3 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
      xmm4 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
      xmm5 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
      xmm6 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
      xmm7 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
      xmm8 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
      xmm9 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     xmm10 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     xmm11 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     xmm12 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     xmm13 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     xmm14 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     xmm15 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}

Exception State Registers:
3 registers were unavailable.

Feb 28 '23 02:02 lundman

@lundman Thanks for testing on macOS.

The odd thing I spotted is this:

0xffffff7f99e92947 <+348>:  addq   0x6, %r11
                                   ^^^

That should read

0xffffff7f99e92947 <+348>:  addq   $0x10, %r11

EDIT:

Removed a lot of nonsense I wrote sleep deprived and added what I should've written in the first place.

Feb 28 '23 16:02 AttilaFueloep

Oh I assumed it was because r11 is NULL, and addq 0x6, %r11 would dereference it, but I admit, I didn't stop and think if it even did dereference

Feb 28 '23 23:02 lundman

That should read 0x10, not 0x06 ($16 in the code, so maybe the 1 got lost?).

No it really is gone:

add $16, \DATA_OFFSET

# llvm-objdump --disassemble ./module/icp/asm-x86_64/modes/macos_zfs-isalc_gcm256_sse.o
    4827: 4c 03 1c 25 06 00 00 00       addq    6, %r11

add 0x10, \DATA_OFFSET

    4827: 4c 03 1c 25 10 00 00 00       addq    16, %r11

Could it be $1 is eaten by the macro definition? Just add 16, %r11 works, so possibly .rept expects $i style notation.

Mar 01 '23 00:03 lundman

More on panic

 panic
panic(cpu 1 caller 0xffffff800dd69aef): Kernel trap at 0xffffff7f91292911, type 14=page fault, registers:
CR0: 0x0000000080010033, CR2: 0x0000000000000010, CR3: 0x0000000011a69000, CR4: 0x00000000003606e0
RAX: 0xffffff7f9135adf0, RBX: 0xffffff90b0094ae8, RCX: 0x0000000000000020, RDX: 0xffffff80a72a3950
RSP: 0xffffff80a72a3200, RBP: 0xffffff80a72a3370, RSI: 0xffffff90b0094ae8, RDI: 0xffffff80a72a3408
R8:  0xffffff90b1f26800, R9:  0xffffff90b1957d40, R10: 0x0000000000000020, R11: 0x0000000000000000
R12: 0x0000000000000002, R13: 0x0000000000000020, R14: 0xffffff80a72a3280, R15: 0x0000000000000000
RFL: 0x0000000000010246, RIP: 0xffffff7f91292911, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0x0000000000000010, Error code: 0x0000000000000000, Fault CPU: 0x1 VMM, PL: 0, VF: 1

-> 2156                 (*(isalc_ops.igo_enc_update[impl][keylen]))(
   2157                     ctx, ct_buf, datap, bleft);

(lldb) p/x ct_buf
(uint8_t *) $5 = 0xffffff90b0094ae8 "\xe1q\xbd\xe2\x99\xc4\xcc\U0000001a<f\U0000

       rsi = 0xffffff90b0094ae8

(lldb) x/40 0xffffff90b0094ae8
0xffffff90b0094ae8: 0xe2bd71e1 0x1accc499 0x7c18663c 0x1f2226c6
0xffffff90b0094af8: 0xb0094d49 0xffffff90 0xf310f1e5 0x00a6389e

looks like rsi is correct, and I can read the memory.

Mar 01 '23 01:03 lundman

D'oh, I was a zombie yesterday, no sleep no brain.

No it really is gone:

Looks like a compiler bug.

add 0x10, \DATA_OFFSET

That adds the contents at memory location 0x10 to the %r11 register. That will of course panic with a page fault. Could you try add $0x10, \DATA_OFFSET?

(I'm assuming that on macOS $ also indicates an immediate value.)

Mar 01 '23 05:03 AttilaFueloep

OK add $(0x10), \DATA_OFFSET doesn't work, but

                add     $(0x10), \DATA_OFFSET

    47cd: 49 83 c3 10                   addq    $16, %r11

Now it crashes further down - lemme walk it until it doesn't crash.

On the $1 thing, closest I found was

AsmParser.cpp

bool AsmParser::expandMacro()

      if (IsDarwin && !NParameters) {
        // This macro has no parameters, look for $0, $1, etc.

Mar 01 '23 06:03 lundman

Ah no, that was the only change needed:

diff --git a/module/icp/asm-x86_64/modes/isalc_gcm_sse.S b/module/icp/asm-x86_64/modes/isalc_gcm_sse.S
index d8bc585fc..c19443687 100644
--- a/module/icp/asm-x86_64/modes/isalc_gcm_sse.S
+++ b/module/icp/asm-x86_64/modes/isalc_gcm_sse.S
@@ -641,7 +641,8 @@ movdqu      16*j(\KEYSCHED), \T_key         // encrypt with last (14th) key round (12 for GC
                XLDR    (\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
                pxor    \T1, xmmi
                XSTR    xmmi, (\CYPH_PLAIN_OUT, \DATA_OFFSET)   // write back ciphertext for \num_initial_blocks blocks
-               add     $16, \DATA_OFFSET
+               add     $(0x10), \DATA_OFFSET
                .ifc \ENC_DEC, DEC
                movdqa  \T1, xmmi
                .endif

(random file copied and verified)
md5 sierra-fix.patch /Volumes/BOOM/gcm/sierra-fix.patch
MD5 (sierra-fix.patch) = cffd2656b57182f0d6d2031e2f545d3f
MD5 (/Volumes/BOOM/gcm/sierra-fix.patch) = cffd2656b57182f0d6d2031e2f545d3f

Mar 01 '23 06:03 lundman

Great! I'll add the workaround.

Mar 01 '23 06:03 AttilaFueloep

Windows compile:

C:\src\openzfs\out\build\x64-Debug\clang -cc1as : fatal error : error in backend: unable to evaluate offset for variable 'breg'

Mar 01 '23 06:03 lundman

So now that's a real problem, not easily fixed. Are both, macOS and Windows clangs at the same version?

Mar 01 '23 07:03 AttilaFueloep

Not at all; macOS is Apple's clang:

Apple clang version 12.0.0 (clang-1200.0.32.29) Target: x86_64-apple-darwin19.6.0

and Windows is clang version 12.0.0 Target: i686-pc-windows-msvc

Note that clang-cl.exe makes it run in Windows-compatibility mode, so a bunch of things behave more like Windows. I'll have to check if that includes macros.

Mar 01 '23 07:03 lundman

OK replacing clang-cl.exe with just clang.exe did not make any difference.

Mar 01 '23 07:03 lundman

I'd bet if you comment out the breg macro use, it will trip over the xmmi macro. And the xmmi one is crucial, without it one would need to expand all macros yielding an unreadable ~10kloc assembly file. As a last resort one could use the original nasm source on Windows but that would be a bit of a nightmare to support.

Mar 01 '23 07:03 AttilaFueloep

The llvm guys said:

The issue is that the code uses the breg symbol like a macro, but on Windows the assembler also tries to emit the symbol itself, which fails because its value is a register.

I don’t know why this is different from non-Windows.

In any case, you can sneak past this error by using the private prefix “.L” on the symbol, so .Lbreg instead of just breg.

I'll see if I can figure out what they mean

Mar 01 '23 21:03 lundman

Well, as far as I understand it, just replace breg with .Lbreg in the macro .macro bytereg reg in isalc_reg_sizes.S (line 88 ...) and do the same in isalc_gcm_sse.S. (lines 357 and 370). Sorry but I currently don't have the time to test it here.

Mar 01 '23 21:03 AttilaFueloep

Yep, breg no longer complain in my test.S, I'll try it in ZFS as well (and the other one)

Wonder if we can #ifdef _WIN32 around it too, so other platforms don't need to see it

Mar 01 '23 22:03 lundman

Not sure I see how the xmmreg macros are generated, I'll play with it

Mar 01 '23 22:03 lundman

I don't think that tacking a .L in front of the symbol would impair readability more than #ifdef'ing it. Rather it would make more clear what we are doing here. The only need I see for #ifdef'ing would be macOS since, if I recall correctly, macOS chokes on .L symbols, correct me if I'm wrong.

The xmmreg macro is in isalc_gcm_defines.S. What it basically does is generate an xmm reg name for the symbol and its value. .set i, 1; xmmreg i, %i would set the symbol xmmi to %xmm1, .set foo, 5; xmmreg foo, %foo would set xmmfoo to %xmm5. (In altmacro mode %sym expands to the value of sym.)

Mar 02 '23 12:03 AttilaFueloep

zfs zfs copied to clipboard

[WIP] ICP: AES_GCM: add support for SSE4.1 acceleration

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

zfs
zfs copied to clipboard