zfs
zfs copied to clipboard
[WIP] ICP: AES_GCM: add support for SSE4.1 acceleration
Motivation and Context
Small NAS solution are often based on hardware with no AVX support, where encryption performs subpar. Most of them do support SSE4.1 on the other hand. Supporting SSE4.1 will give them a massive boost in aes-gcm crypto performance.
Description
This is still a draft, a detailed description will be added once this is ready for merging.
Closes #10846
How Has This Been Tested?
Inside a VM two fio
invocations on an encrypted datasets and mprime -t
were run in parallel. Various other tests
were done as well. Finally, this PR contains debugging code which runs the AVX and SSE implementations on the same context an compares the results (See the DEBUG_GCM_ASM #define).
Types of changes
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [x] Performance enhancement (non-breaking change which improves efficiency)
- [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
- [ ] Documentation (a change to man pages or other documentation)
Checklist:
- [x] My code follows the OpenZFS code style requirements.
- [ ] I have updated the documentation accordingly.
- [x] I have read the contributing document.
- [ ] I have added tests to cover my changes.
- [ ] I have run the ZFS Test Suite with this change applied.
- [x] All commit messages are properly formatted and contain
Signed-off-by
.
While it's still a draft, I took care to squash the commits in a way which should make reviewing this straight forward.
I did quite some testing without revealing any problems, but at this point I'd appreciate any help on testing on a wider range of systems. If you need any help on how to test this, please drop a note here and I'll instruct. If you want to test this, in any case please do it on a test VM with a test pool since this code could permanently corrupt any pools accessed from it.
There's still some cleanup an some minor refinements left, so this is a draft for now. I'm also wondering if I should keep the debugging code in the #ifdef DEBUG_GCM_ASM
conditionals in the final PR. I'm leaning towards it since it would ease developing of the planned follow-ups for AVX2 and maybe VAES. They could be removed once the dust settles and the code gets good usage coverage.
The people at ixsystems may be interested in testing this too. Surely they'll have a far better test infrastructure to test this on than I do.
@lundman I took major efforts to make this clang compatible on Linux. If you have a spare hour you may want to test if it works on macOS and Windows as well.
One important point regarding testing I missed above:
If you are testing this on hardware which has AVX support you need to activate SSE support by doing
# echo sse4_1 >/sys/module/zfs/parameters/icp_gcm_impl
or setting it on module load time.
Excellent, I will give it a go on macOS and Windows.
Thank you a lot for your work!
I have a ReadyNAS 426 with an Intel(R) Atom(TM) CPU C3538 @ 2.10GHz without AVX with SSE3, SSE4.1 & SSE4.2. I'd be willing to test your implementation.
I have 6 × 8TB drives in ZFS for storage (ZFS directly uses the disks and autoformatted them to GPT: 7.3T ZFS + 8M reserved) and 2 × 64GB High Endurance microSD cards in RAID with debian bookworm.
Could you please provide instructions on how I can best test this PR?
I did a light fio run on a small Celeron J3455 [Goldmont] board with SSE4.2 but no AVX. Using a single-vdev tmpfs-backed pool, no compression:
-
generic
: ~3.7MB/s -
pclmulqdq
: ~6.1MB/s -
sse4_1
: ~13.2MB/s - no encryption: ~17MB/s
fio was: fio --name=xxx --size=4K --bs=4K --rw=randrw --ioengine=psync --sync=1 --iodepth=32 --numjobs=1 --direct=1 --end_fsync=1 --gtod_reduce=1 --time_based --runtime=60
Hardly science, but those initial numbers seem pretty good. Strong work!
Thank you @AttilaFueloep, I too have very promising results on a C3558 system running Debian Bullseye under KVM hypervisor and WD Black SN750 NVMe disk.
In my tests below I just used dd to read a big file with arc size limited to 500MB.
Unencrypted read speed was 909MB/s, the old pclmulqdq
implementation did 86MB/s, the new sse4_1
implementation increased speed to 670MB/s! Enabling compression, the speed stayed roughly the same. For reference, FreeBSD achieved roughly similar speed (660MB/s) in my previous tests on a similar setup.
# uname -a
Linux labdebian 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux
# echo 524288000 >/sys/module/zfs/parameters/zfs_arc_max
# zfs create -o compression=off test/noenc
# zfs create -o encryption=on -o keyformat=passphrase -o compression=on test/enc2
# zfs create -o encryption=on -o keyformat=passphrase -o compression=off test/enc3
# cp /mnt/media/test.tar /test/noenc/
# cp /mnt/media/test.tar /test/enc3/
# dd if=/test/noenc/test.tar bs=128k of=/dev/null status=progress
6319505408 bytes (6.3 GB, 5.9 GiB) copied, 7 s, 903 MB/s
52328+1 records in
52328+1 records out
6858762240 bytes (6.9 GB, 6.4 GiB) copied, 7.54886 s, 909 MB/s
# echo generic > /sys/module/zfs/parameters/icp_gcm_impl
# dd if=/test/enc3/test.tar bs=128k of=/dev/null status=progress
time dd if=/test/enc3/test.tar bs=128k of=/dev/null status=progress
6840385536 bytes (6.8 GB, 6.4 GiB) copied, 189 s, 36.2 MB/s
52328+1 records in
52328+1 records out
6858762240 bytes (6.9 GB, 6.4 GiB) copied, 189.506 s, 36.2 MB/s
# echo pclmulqdq > /sys/module/zfs/parameters/icp_gcm_impl
# dd if=/test/enc3/test.tar bs=128k of=/dev/null status=progress
6820200448 bytes (6.8 GB, 6.4 GiB) copied, 79 s, 86.3 MB/s
52328+1 records in
52328+1 records out
6858762240 bytes (6.9 GB, 6.4 GiB) copied, 79.4401 s, 86.3 MB/s
# echo sse4_1 > /sys/module/zfs/parameters/icp_gcm_impl
# dd if=/test/enc3/test.tar bs=128k of=/dev/null status=progress
6715080704 bytes (6.7 GB, 6.3 GiB) copied, 10 s, 672 MB/s
52328+1 records in
52328+1 records out
6858762240 bytes (6.9 GB, 6.4 GiB) copied, 10.1724 s, 674 MB/s
# dd if=/test/enc2/test.tar bs=128k of=/dev/null status=progress
6629883904 bytes (6.6 GB, 6.2 GiB) copied, 10 s, 663 MB/s
52328+1 records in
52328+1 records out
6858762240 bytes (6.9 GB, 6.4 GiB) copied, 10.2971 s, 666 MB/s
@oberien Well, I'd suggest the following:
Spin up a VM which has all the dependencies for building ZFS installed, but not the ZFS packages itself.
Inside the VM do:
git clone http://github.com/openzfs/zfs/
cd zfs
git pull origin pull/14531/head
./autogen.sh && ./configure --prefix=/ --enable-debuginfo --enable-debug && make
Open a new root terminal inside the VM since for me the installation step failed with sudo.
cd <dir where you build zfs>
make install
modprobe zfs
echo sse4_1 >/sys/module/zfs/parameters/icp_gcm_impl
zpool create ...
zfs create -o encryption= ...
Run your favorite workloads and benchmark tools.
I tested using an 8GB Ramdisk both on a Debian host (shipped ZFS) and in a Debian VM (zfs compiled from this PR).
Interestingly, ./configure
states:
checking whether host toolchain supports AVX... yes
checking whether host toolchain supports AVX2... yes
checking whether host toolchain supports AVX512F... yes
Setup VM (on Debian):
sudo apt install --no-install-recommends qemu-system libvirt-clients libvirt-daemon-system virtinst dnsmasq
virsh net-start default
virt-install --virt-type kvm --name bullseye-amd64 \
--location http://deb.debian.org/debian/dists/bullseye/main/installer-amd64/ \
--os-variant debian11 \
--disk size=10 --memory 10000 \
--graphics none \
--console pty,target_type=serial \
--extra-args "console=ttyS0"
Setup ZFS:
sudo apt install build-essential git automake libtool gettext zlib1g-dev uuid-dev libblkid-dev libssl-dev linux-headers-$(uname -r) libaio-dev libudev-dev
git config --global pull.rebase true
git config --global user.email [email protected]
git clone https://github.com/openzfs/zfs/
cd zfs
git pull origin pull/14531/head
./autogen.sh && ./configure --prefix=/ --enable-debuginfo --enable-debug && make
make install
depmod
modprobe zfs
echo sse4_1 >/sys/module/zfs/parameters/icp_gcm_impl
Test code:
zfs --version
mkdir /tmp/tmpfs
mount -t tmpfs none /tmp/tmpfs
truncate -s 8G /tmp/tmpfs/test
zpool create -o ashift=12 -O mountpoint=none test /tmp/tmpfs/test
# without encryption
zfs create -o atime=off -o compression=off -o mountpoint=/test test/test
dd if=/dev/zero of=/test/test bs=1M count=4k
zpool export test && zpool import -d /tmp/tmpfs/test test -l
dd if=/test/test of=/dev/null bs=1M count=4096
zfs destroy test/test
# with encryption
dd if=/dev/urandom of=/tmp/test_key bs=1 count=32
zfs create -o encryption=aes-256-gcm -o keyformat=raw -o keylocation=file:///tmp/test_key -o atime=off -o compression=off -o mountpoint=/test test/test
dd if=/dev/zero of=/test/test bs=1M count=4k
zpool export test && zpool import -d /tmp/tmpfs/test test -l
dd if=/test/test of=/dev/null bs=1M count=4096
zfs destroy test/test
# cleanup
zpool destroy test
rmdir /test
rm /tmp/test_key
rm /tmp/tmpfs/test
umount /tmp/tmpfs
rmdir /tmp/tmpfs
Results:
For comparison `perf bench mem memcpy`:
Host: 2.64GB/s
Guest: 1.93GB/s
Output host unencrypted:
zfs-2.1.9-1
zfs-kmod-2.1.7-1
write: 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 7.29287 s, 589 MB/s
read: 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 4.66974 s, 920 MB/s
Output host encrypted:
write: 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 27.5948 s, 156 MB/s
read: 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 22.4259 s, 192 MB/s
Output guest unencrypted:
zfs-2.1.99-1750_g57314b5ec
zfs-kmod-2.1.99-1750_g57314b5ec
write: 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 9.31044 s, 461 MB/s
read: 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 7.15297 s, 600 MB/s
Output guest encrypted:
write: 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 14.0712 s, 305 MB/s
read: 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 15.1526 s, 283 MB/s
tldr: Host without encryption is 150% as fast as the VM. Host with encryption is about 50% as fast as the VM. Assuming the unencrypted difference is due to VM overhead, I assume host with encryption with this patch to be in the ballpark of ~3× as fast as before.
@oberien
Interestingly,
./configure
states:checking whether host toolchain supports AVX... yes checking whether host toolchain supports AVX2... yes checking whether host toolchain supports AVX512F... yes
configure
just checks if your toolchain has support for SIMD instructions. The existence of SIMD instructions is checked at runtime (e.g zfs_avx_available()
)
macOS changes to compile:
https://github.com/openzfsonosx/openzfs-fork/commit/df3931c27302acbe2f59a0b9e72d5e474566e968
The FN_NAME
fix is a quick hack, we should discuss that sometime.
# sysctl kstat.zfs.darwin.tunable.icp_gcm_impl=sse4_1
kstat.zfs.darwin.tunable.icp_gcm_impl: cycle [fastest] sse4_1 generic pclmulqdq -> cycle fastest [sse4_1] generic pclmulqdq
# ./zfs create -o encryption=aes-256-gcm -o keyformat=passphrase BOOM/gcm
Enter new passphrase:
Re-enter new passphrase:
panic
0xffffff801e72b0f0 : 0xffffff801678ba40 mach_kernel : trap_from_kernel + 0x26
0xffffff801e72b110 : 0xffffff7f99e92947 org.openzfsonosx.zfs : _initial_num_blocks_is_2_34 + 0x15c
0xffffff801e72b370 : 0xffffff7f99d51d56 org.openzfsonosx.zfs : _aes_encrypt_contiguous_blocks + 0x36
0xffffff801e72b390 : 0xffffff7f99d3a1c1 org.openzfsonosx.zfs : _crypto_update_uio + 0x251
0xffffff801e72b3f0 : 0xffffff7f99d3ac21 org.openzfsonosx.zfs : _aes_encrypt_atomic + 0x181
0xffffff801e72b510 : 0xffffff7f99d387cd org.openzfsonosx.zfs : _crypto_encrypt + 0xdd
0xffffff801e72b590 : 0xffffff7f99d1b880 org.openzfsonosx.zfs : _zio_do_crypt_uio + 0x1c0
0xffffff801e72b6b0 : 0xffffff7f99d1b693 org.openzfsonosx.zfs : _zio_crypt_key_wrap + 0x153
frame #0: 0xffffff7f99e92947 zfs`initial_num_blocks_is_2_34 at isalc_gcm_sse.S:2342
(lldb) disass
zfs`initial_num_blocks_is_2_34:
0xffffff7f99e927eb <+0>: movdqu %xmm8, %xmm6
0xffffff7f99e927f0 <+5>: movdqu 0x10(%rdi), %xmm9
0xffffff7f99e927f6 <+11>: paddd 0xbd231(%rip), %xmm9 ; ZERO + 14
0xffffff7f99e927ff <+20>: movdqa %xmm9, %xmm7
0xffffff7f99e92804 <+25>: pshufb 0xbd1b3(%rip), %xmm7 ; TWOONE + 62
0xffffff7f99e9280d <+34>: paddd 0xbd21a(%rip), %xmm9 ; ZERO + 14
0xffffff7f99e92816 <+43>: movdqa %xmm9, %xmm8
0xffffff7f99e9281b <+48>: pshufb 0xbd19b(%rip), %xmm8 ; TWOONE + 62
0xffffff7f99e92825 <+58>: movdqu (%r8), %xmm0
0xffffff7f99e9282a <+63>: pxor %xmm0, %xmm7
...
0xffffff7f99e92936 <+331>: movdqu (%rdx,%r11), %xmm12
0xffffff7f99e9293c <+337>: pxor %xmm12, %xmm7
0xffffff7f99e92941 <+342>: movdqu %xmm7, (%rsi,%r11)
-> 0xffffff7f99e92947 <+348>: addq 0x6, %r11
0xffffff7f99e9294f <+356>: pshufb 0xbd068(%rip), %xmm7 ; TWOONE + 62
0xffffff7f99e92958 <+365>: movdqu (%rdx,%r11), %xmm12
0xffffff7f99e9295e <+371>: pxor %xmm12, %xmm8
show registers
(lldb) register read -all
General Purpose Registers:
rax = 0xffffff7f99f5adf0 zfs`isalc_ops
rbx = 0xffffff90c8094ae8
rcx = 0x0000000000000020
rdx = 0xffffff801e72b950
rdi = 0xffffff801e72b408
rsi = 0xffffff90c8094ae8
rbp = 0xffffff801e72b370
rsp = 0xffffff801e72b200
r8 = 0xffffff90c9c77a00
r9 = 0xffffff90c96f7840
r10 = 0x0000000000000020
r11 = 0x0000000000000000
r12 = 0x0000000000000002
r13 = 0x0000000000000020
r14 = 0xffffff801e72b280
r15 = 0x0000000000000000
rip = 0xffffff7f99e92947 zfs`initial_num_blocks_is_2_34 + 348
rflags = 0x0000000000010246
cs = 0x0000000000000008
fs = 0x00000000ffff0000
gs = 0x000000001e720000
Floating Point Registers:
fcw = 0x0000
fsw = 0x0000
ftw = 0x00
fop = 0x0000
ip = 0x00000000
cs = 0x0000
dp = 0x00000000
ds = 0x0000
mxcsr = 0x00000000
mxcsrmask = 0x00000000
stmm0 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
stmm1 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
stmm2 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
stmm3 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
stmm4 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
stmm5 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
stmm6 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
stmm7 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
xmm0 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
xmm1 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
xmm2 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
xmm3 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
xmm4 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
xmm5 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
xmm6 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
xmm7 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
xmm8 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
xmm9 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
xmm10 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
xmm11 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
xmm12 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
xmm13 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
xmm14 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
xmm15 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
Exception State Registers:
3 registers were unavailable.
@lundman Thanks for testing on macOS.
The odd thing I spotted is this:
0xffffff7f99e92947 <+348>: addq 0x6, %r11
^^^
That should read
0xffffff7f99e92947 <+348>: addq $0x10, %r11
EDIT:
Removed a lot of nonsense I wrote sleep deprived and added what I should've written in the first place.
Oh I assumed it was because r11
is NULL, and addq 0x6, %r11
would dereference it, but I admit, I didn't stop and think if it even did dereference
That should read 0x10, not 0x06 (
$16
in the code, so maybe the1
got lost?).
No it really is gone:
add $16, \DATA_OFFSET
# llvm-objdump --disassemble ./module/icp/asm-x86_64/modes/macos_zfs-isalc_gcm256_sse.o
4827: 4c 03 1c 25 06 00 00 00 addq 6, %r11
add 0x10, \DATA_OFFSET
4827: 4c 03 1c 25 10 00 00 00 addq 16, %r11
Could it be $1
is eaten by the macro definition? Just add 16, %r11
works, so possibly .rept expects $i style notation.
More on panic
panic
panic(cpu 1 caller 0xffffff800dd69aef): Kernel trap at 0xffffff7f91292911, type 14=page fault, registers:
CR0: 0x0000000080010033, CR2: 0x0000000000000010, CR3: 0x0000000011a69000, CR4: 0x00000000003606e0
RAX: 0xffffff7f9135adf0, RBX: 0xffffff90b0094ae8, RCX: 0x0000000000000020, RDX: 0xffffff80a72a3950
RSP: 0xffffff80a72a3200, RBP: 0xffffff80a72a3370, RSI: 0xffffff90b0094ae8, RDI: 0xffffff80a72a3408
R8: 0xffffff90b1f26800, R9: 0xffffff90b1957d40, R10: 0x0000000000000020, R11: 0x0000000000000000
R12: 0x0000000000000002, R13: 0x0000000000000020, R14: 0xffffff80a72a3280, R15: 0x0000000000000000
RFL: 0x0000000000010246, RIP: 0xffffff7f91292911, CS: 0x0000000000000008, SS: 0x0000000000000010
Fault CR2: 0x0000000000000010, Error code: 0x0000000000000000, Fault CPU: 0x1 VMM, PL: 0, VF: 1
-> 2156 (*(isalc_ops.igo_enc_update[impl][keylen]))(
2157 ctx, ct_buf, datap, bleft);
(lldb) p/x ct_buf
(uint8_t *) $5 = 0xffffff90b0094ae8 "\xe1q\xbd\xe2\x99\xc4\xcc\U0000001a<f\U0000
rsi = 0xffffff90b0094ae8
(lldb) x/40 0xffffff90b0094ae8
0xffffff90b0094ae8: 0xe2bd71e1 0x1accc499 0x7c18663c 0x1f2226c6
0xffffff90b0094af8: 0xb0094d49 0xffffff90 0xf310f1e5 0x00a6389e
looks like rsi
is correct, and I can read the memory.
D'oh, I was a zombie yesterday, no sleep no brain.
No it really is gone:
Looks like a compiler bug.
add 0x10, \DATA_OFFSET
That adds the contents at memory location 0x10 to the %r11 register. That will of course panic with a page fault. Could you try add $0x10, \DATA_OFFSET
?
(I'm assuming that on macOS $
also indicates an immediate value.)
OK
add $(0x10), \DATA_OFFSET
doesn't work, but
add $(0x10), \DATA_OFFSET
47cd: 49 83 c3 10 addq $16, %r11
Now it crashes further down - lemme walk it until it doesn't crash.
On the $1 thing, closest I found was
AsmParser.cpp
bool AsmParser::expandMacro()
if (IsDarwin && !NParameters) {
// This macro has no parameters, look for $0, $1, etc.
Ah no, that was the only change needed:
diff --git a/module/icp/asm-x86_64/modes/isalc_gcm_sse.S b/module/icp/asm-x86_64/modes/isalc_gcm_sse.S
index d8bc585fc..c19443687 100644
--- a/module/icp/asm-x86_64/modes/isalc_gcm_sse.S
+++ b/module/icp/asm-x86_64/modes/isalc_gcm_sse.S
@@ -641,7 +641,8 @@ movdqu 16*j(\KEYSCHED), \T_key // encrypt with last (14th) key round (12 for GC
XLDR (\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
pxor \T1, xmmi
XSTR xmmi, (\CYPH_PLAIN_OUT, \DATA_OFFSET) // write back ciphertext for \num_initial_blocks blocks
- add $16, \DATA_OFFSET
+ add $(0x10), \DATA_OFFSET
.ifc \ENC_DEC, DEC
movdqa \T1, xmmi
.endif
(random file copied and verified)
md5 sierra-fix.patch /Volumes/BOOM/gcm/sierra-fix.patch
MD5 (sierra-fix.patch) = cffd2656b57182f0d6d2031e2f545d3f
MD5 (/Volumes/BOOM/gcm/sierra-fix.patch) = cffd2656b57182f0d6d2031e2f545d3f
Great! I'll add the workaround.
Windows compile:
C:\src\openzfs\out\build\x64-Debug\clang -cc1as : fatal error : error in backend: unable to evaluate offset for variable 'breg'
So now that's a real problem, not easily fixed. Are both, macOS and Windows clangs at the same version?
Not at all; macOS is Apple's clang:
Apple clang version 12.0.0 (clang-1200.0.32.29) Target: x86_64-apple-darwin19.6.0
and Windows is
clang version 12.0.0 Target: i686-pc-windows-msvc
Note that clang-cl.exe
makes it run in Windows-compatibility mode, so a bunch of things behave more like Windows. I'll
have to check if that includes macros.
OK replacing clang-cl.exe
with just clang.exe
did not make any difference.
I'd bet if you comment out the breg
macro use, it will trip over the xmmi
macro. And the xmmi
one is crucial, without it one would need to expand all macros yielding an unreadable ~10kloc assembly file. As a last resort one could use the original nasm
source on Windows but that would be a bit of a nightmare to support.
The llvm guys said:
The issue is that the code uses the breg symbol like a macro, but on Windows the assembler also tries to emit the symbol itself, which fails because its value is a register.
I don’t know why this is different from non-Windows.
In any case, you can sneak past this error by using the private prefix “.L” on the symbol, so .Lbreg instead of just breg.
I'll see if I can figure out what they mean
Well, as far as I understand it, just replace breg
with .Lbreg
in the macro .macro bytereg reg
in isalc_reg_sizes.S
(line 88 ...) and do the same in isalc_gcm_sse.S.
(lines 357 and 370). Sorry but I currently don't have the time to test it here.
Yep, breg no longer complain in my test.S, I'll try it in ZFS as well (and the other one)
Wonder if we can #ifdef _WIN32 around it too, so other platforms don't need to see it
Not sure I see how the xmmreg macros are generated, I'll play with it
I don't think that tacking a .L
in front of the symbol would impair readability more than #ifdef'ing it. Rather it would make more clear what we are doing here. The only need I see for #ifdef'ing would be macOS since, if I recall correctly, macOS chokes on .L
symbols, correct me if I'm wrong.
The xmmreg macro is in isalc_gcm_defines.S. What it basically does is generate an xmm reg name for the symbol and its value. .set i, 1; xmmreg i, %i
would set the symbol xmmi
to %xmm1
, .set foo, 5; xmmreg foo, %foo
would set xmmfoo
to %xmm5
. (In altmacro mode %sym
expands to the value of sym
.)