zfs icon indicating copy to clipboard operation
zfs copied to clipboard

Detect SHA extensions

Open cybojanek opened this issue 2 years ago • 26 comments

Detect SHA CPU extensions

Motivation and Context

Detect and use SHA / vector CPU extensions in order to optimize checksum calculations.

Description

  • Add SHA extension detection
  • Add icp algorithm selector
  • Use selector with existing sha2 code
  • Add fast implementations

How Has This Been Tested?

  • Compiled on ArchLinux

Types of changes

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [x] Performance enhancement (non-breaking change which improves efficiency)
  • [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • [ ] Documentation (a change to man pages or other documentation)

Checklist:

  • [x] My code follows the OpenZFS code style requirements.
  • [ ] I have updated the documentation accordingly.
  • [x] I have read the contributing document.
  • [ ] I have added tests to cover my changes.
  • [ ] I have run the ZFS Test Suite with this change applied.
  • [x] All commit messages are properly formatted and contain Signed-off-by.

cybojanek avatar Sep 09 '21 00:09 cybojanek

Hi - I see some of CI is failing.

I don't think these failures are related to my changes, since this code is not even used at runtime.

Please tell me if I should look into the failures.

cybojanek avatar Sep 13 '21 13:09 cybojanek

@cybojanek Awesome, really looking forward for the referenced improvements.

I just want to link some old but never finished PRs related to SHA (not sure if they still apply to current OpenZFS though):

Multi-buffer sha256 support in SPL to ZFS (https://github.com/openzfs/spl/pull/646) sha256 x86_64 optimization v2 (https://github.com/openzfs/zfs/pull/2351)

jumbi77 avatar Sep 18 '21 19:09 jumbi77

Thanks for the links to the previous issues.

Just posting here that I'm still working on this issue.

(No ETA - busy with family/work)

cybojanek avatar Sep 22 '21 00:09 cybojanek

Hi - I see some of CI is failing.

I don't think these failures are related to my changes, since this code is not even used at runtime.

Please tell me if I should look into the failures.

Thanks for working on this,

The test failures are known failures, i.e. unrelated to your changes. I'll look into the build failure some more.

tonynguien avatar Sep 30 '21 15:09 tonynguien

Thanks for the links to the previous issues.

Just posting here that I'm still working on this issue.

(No ETA - busy with family/work)

I labeled the PR as "Work in Progress" and will update status once you give the go.

tonynguien avatar Sep 30 '21 15:09 tonynguien

Some performance numbers using an EC2 m6i.xlarge instance

echo x86_64 > /sys/module/icp/parameters/icp_sha256_impl

modprobe brd rd_nr=1 rd_size=$((12288 * 1024))

zpool create -f -o ashift=12 \
    -O acltype=posixacl \
    -O relatime=on \
    -O xattr=sa \
    -O dnodesize=legacy \
    -O normalization=formD \
    -O devices=off \
    -O compression=off \
    -O checksum=sha256 \
    zscratch /dev/ram0

dd if=/dev/urandom of=/zscratch/data.bin bs=1M count=12000 status=progress conv=fdatasync
zpool export zscratch

for X in generic x86_64 sha-avx sha-ssse3 sha-ni; do
        echo $X > /sys/module/icp/parameters/icp_sha256_impl
        sleep 1
        cat /sys/module/icp/parameters/icp_sha256_impl

        zpool import zscratch
        echo ""
        dd if=/zscratch/data.bin of=/dev/null bs=1M status=progress
        zpool export zscratch
done
cycle fastest [generic] x86_64 sha-avx sha-ssse3 sha-ni
11952848896 bytes (12 GB, 11 GiB) copied, 27.6342 s, 433 MB/s

cycle fastest generic [x86_64] sha-avx sha-ssse3 sha-ni
11952848896 bytes (12 GB, 11 GiB) copied, 21.6019 s, 553 MB/s

cycle fastest generic x86_64 [sha-avx] sha-ssse3 sha-ni
11952848896 bytes (12 GB, 11 GiB) copied, 18.1928 s, 657 MB/s

cycle fastest generic x86_64 sha-avx [sha-ssse3] sha-ni
11952848896 bytes (12 GB, 11 GiB) copied, 18.7719 s, 637 MB/s

cycle fastest generic x86_64 sha-avx sha-ssse3 [sha-ni]
11952848896 bytes (12 GB, 11 GiB) copied, 6.21747 s, 1.9 GB/s

I also did a similar thing with scrub: zpool export, change algorithm, zpool import, zpool scrub:

x86_64
  scan: scrub repaired 0B in 00:00:21 with 0 errors on Mon Nov 15 01:52:34 2021

sha-avx
  scan: scrub repaired 0B in 00:00:18 with 0 errors on Mon Nov 15 01:53:35 2021

sha-ssse3
  scan: scrub repaired 0B in 00:00:19 with 0 errors on Mon Nov 15 01:54:30 2021

sha-ni
  scan: scrub repaired 0B in 00:00:05 with 0 errors on Mon Nov 15 01:55:03 2021
root@ip-172-31-40-243:/home/ubuntu/zfs# cat /proc/spl/kstat/zfs/sha256_bench
4 0 0x01 -1 0 1431694018884 3006832037321
implementation   bytes/second   
fastest          1336724300     
generic          239174080      
x86_64           330923150      
sha-avx          326477949      
sha-ssse3        323231027      
sha-ni           1336724300     
root@ip-172-31-40-243:/home/ubuntu/zfs# cat /proc/spl/kstat/zfs/sha512_bench
5 0 0x01 -1 0 1431706970891 3009816103445
implementation   bytes/second   
fastest          570998846      
generic          365103572      
x86_64           493050536      
sha-avx          516670214      
sha-avx2         570998846      
sha-ssse3        474291080      
root@ip-172-31-40-243:/home/ubuntu/zfs# 

cybojanek avatar Nov 15 '21 02:11 cybojanek

@tonynguien Hi! I think this is ready for review.

@rincebrain Helped me fix a few things in the initial review here https://github.com/cybojanek/zfs/pull/1

cybojanek avatar Nov 15 '21 02:11 cybojanek

Hey,

Nov 25 12:41:52 wip kernel: [    8.713282] CFI failure (target: sha256_avx_transform+0x0/0x8 [icp]):
Nov 25 12:41:52 wip kernel: [   13.354139] CFI failure (target: sha256_ssse3_transform+0x0/0x8 [icp]):

Please view:

ICP: Add missing stack frame info to SHA asm files

CFI directives

AndyLavr avatar Nov 27 '21 08:11 AndyLavr

Hey,

Nov 25 12:41:52 wip kernel: [    8.713282] CFI failure (target: sha256_avx_transform+0x0/0x8 [icp]):
Nov 25 12:41:52 wip kernel: [   13.354139] CFI failure (target: sha256_ssse3_transform+0x0/0x8 [icp]):

Please view:

ICP: Add missing stack frame info to SHA asm files

CFI directives

How did you see those warning messages - do they just show up when you load the module?

cybojanek avatar Nov 27 '21 15:11 cybojanek

How did you see those warning messages - do they just show up when you load the module?

Boot process, dmesg info. I`m build the Linux kernel with Clang + LTO + CFI. Debug from Control-Flow Integrity (CFI).

[8.084305] ------------[ cut here ]------------
[8.084335] CFI failure (target: sha256_avx_transform+0x0/0x8 [icp]):
[8.084385] WARNING: CPU: 4 PID: 357 at kernel/cfi.c:29 __ubsan_handle_cfi_check_fail+0x31/0x40
[8.084417] Modules linked in: icp(+) zzstd zcommon znvpair spl zlib_deflate amdgpu iommu_v2 gpu_sched drm_ttm_helper ttm i2c_algo_bit drm_kms_helper cec sysimgblt syscopyarea sysfillrect aesni_intel fb_sys_fops crypto_simd psmouse input_leds cryptd serio_raw drm wmi video mac_hid
[8.085710] CPU: 4 PID: 357 Comm: modprobe Tainted: G        W         5.16.0-generic #20211125 6122db310810d441500a2ad55360e9f50df200be
[8.086967] Hardware name: Dell Inc. Precision M6600/04YY4M, BIOS A18 09/14/2018
[8.088225] RIP: 0010:__ubsan_handle_cfi_check_fail+0x31/0x40
[8.089485] Code: 89 f3 48 c7 c7 00 00 05 93 48 c7 c6 ac 05 ee 8c e8 34 2d 56 00 85 c0 75 02 5b c3 48 c7 c7 f3 23 e6 8c 48 89 de e8 5f c2 e1 ff <0f> 0b 5b c3 00 00 cc cc 00 00 cc cc 00 00 cc 0f 1f 44 00 00 c3 00
[8.090823] RSP: 0018:ffffb72d8143ba20 EFLAGS: 00010246
[8.092206] RAX: 9f33b67afe331200 RBX: ffffffffc114f750 RCX: 0000000000000002
[8.093598] RDX: ffffb72d8143b8d0 RSI: 0000000000000004 RDI: 00000000ffffffff
[8.094969] RBP: ffffb72d8143bca8 R08: ffffffff90f90000 R09: 0000000000000000
[8.096318] R10: 00000000ffffdfff R11: 00000000ffffffff R12: ffffffffc114f510
[8.097680] R13: ffffffffc11a8c78 R14: ffff8a3eb4ff0000 R15: ffffffffc114f750
[8.099041] FS:  00007f6d523a4b80(0000) GS:ffff8a415db00000(0000) knlGS:0000000000000000
[8.100411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[8.101786] CR2: 00007ffca8385508 CR3: 0000000169b44004 CR4: 00000000000606e0
[8.103174] Call Trace:
[8.104556]  <TASK>
[8.105934]  sha256_alg_impl_benchmark+0xa4/0xb0 [icp a234ac88fc385150d04519757fcc356c823dd90b]
[8.107364]  alg_impl_init+0x2b0/0x4b0 [icp a234ac88fc385150d04519757fcc356c823dd90b]
[8.108777]  ? _raw_spin_unlock+0x12/0x30
[8.110166]  ? kcf_do_notify+0xe1/0x110 [icp a234ac88fc385150d04519757fcc356c823dd90b]
[8.111579]  ? crypto_register_provider+0x69c/0x6f0 [icp a234ac88fc385150d04519757fcc356c823dd90b]
[8.112995]  ? crypto_register_provider+0x69c/0x6f0 [icp a234ac88fc385150d04519757fcc356c823dd90b]
[8.114394]  ? _raw_write_lock+0x13/0x30
[8.115767]  ? _raw_write_unlock+0x12/0x30
[8.117127]  ? proc_register+0x19a/0x1b0
[8.118487]  ? 0xffffffffc1133000
[8.119830]  ? rcu_nmi_exit+0x1f/0x80
[8.121166]  ? rcu_irq_exit_irqson+0x2d/0x60
[8.122496]  ? sha512_ssse3_will_work+0x8/0x8 [icp a234ac88fc385150d04519757fcc356c823dd90b]
[8.123853]  ? crypto_digest_init+0x10/0x10 [icp a234ac88fc385150d04519757fcc356c823dd90b]
[8.125174]  sha2_mod_init+0x12/0x60 [icp a234ac88fc385150d04519757fcc356c823dd90b]
[8.126457]  init_module+0x2f/0x1000 [icp a234ac88fc385150d04519757fcc356c823dd90b]
[8.127715]  do_one_initcall+0xa7/0x260
[8.128952]  do_init_module+0x5a/0x230
[8.130150]  load_module+0x196d/0x1ad0
[8.131306]  ? __x64_sys_rmdir+0x8/0x8
[8.132446]  __x64_sys_finit_module+0xad/0xe0
[8.133588]  do_syscall_64+0x93/0x130
[8.134729]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[8.135871] RIP: 0033:0x7f6d524c794d
[8.137012] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b3 64 0f 00 f7 d8 64 89 01 48
[8.138226] RSP: 002b:00007ffca8389598 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[8.139439] RAX: ffffffffffffffda RBX: 00005622d7f9bab0 RCX: 00007f6d524c794d
[8.140652] RDX: 0000000000000000 RSI: 00005622d5fd9c12 RDI: 0000000000000008
[8.141853] RBP: 0000000000060000 R08: 0000000000000000 R09: 0000000000000002
[8.143063] R10: 0000000000000008 R11: 0000000000000246 R12: 00005622d5fd9c12
[8.144277] R13: 00005622d7fa34a0 R14: 00005622d7fa2c20 R15: 00005622d7f9ca70
[8.145496]  </TASK>
[8.146708] ---[ end trace 302011d136109a85 ]---
[8.148064] ------------[ cut here ]------------
[13.354137] ------------[ cut here ]------------
[13.354139] CFI failure (target: sha256_ssse3_transform+0x0/0x8 [icp]):
[13.354160] WARNING: CPU: 7 PID: 1545 at kernel/cfi.c:29 __ubsan_handle_cfi_check_fail+0x31/0x40
[13.354165] Modules linked in: intel_rapl_msr hid_generic dell_rbtn at24 intel_rapl_common dell_laptop x86_pkg_temp_thermal intel_powerclamp dell_smm_hwmon coretemp dell_wmi sparse_keymap crct10dif_pclmul snd_hda_codec_idt crc32_pclmul snd_hda_codec_generic ghash_clmulni_intel iwldvm ledtrig_audio snd_hda_codec_hdmi rapl dell_smbios mac80211 intel_cstate dcdbas snd_hda_intel libarc4 joydev wmi_bmof snd_intel_dspcfg dell_wmi_descriptor usbhid snd_intel_sdw_acpi iwlwifi i2c_i801 hid snd_hda_codec i2c_smbus sdhci_pci snd_hda_core mei_me cqhci cfg80211 sdhci mei snd_hwdep dell_smo8800 sch_fq tcp_htcp msr parport_pc ppdev parport ip_tables x_tables zfs zlua zunicode zavl icp zzstd zcommon znvpair spl zlib_deflate amdgpu iommu_v2 gpu_sched drm_ttm_helper ttm i2c_algo_bit drm_kms_helper cec sysimgblt syscopyarea sysfillrect aesni_intel fb_sys_fops crypto_simd psmouse input_leds cryptd serio_raw drm wmi video mac_hid
[13.354217] CPU: 7 PID: 1545 Comm: z_null_int Tainted: G        W         5.16.0-generic #20211125 6122db310810d441500a2ad55360e9f50df200be
[13.354220] Hardware name: Dell Inc. Precision M6600/04YY4M, BIOS A18 09/14/2018
[13.354221] RIP: 0010:__ubsan_handle_cfi_check_fail+0x31/0x40
[13.354224] Code: 89 f3 48 c7 c7 00 00 05 93 48 c7 c6 ac 05 ee 8c e8 34 2d 56 00 85 c0 75 02 5b c3 48 c7 c7 f3 23 e6 8c 48 89 de e8 5f c2 e1 ff <0f> 0b 5b c3 00 00 cc cc 00 00 cc cc 00 00 cc 0f 1f 44 00 00 c3 00
[13.354225] RSP: 0018:ffffb72d8f11f930 EFLAGS: 00010246
[13.354227] RAX: 4b2119f55d67eb00 RBX: ffffffffc114f748 RCX: 0000000000000001
[13.354229] RDX: ffffffff8c20a9f6 RSI: ffffffff8cede298 RDI: 00000000ffffffff
[13.354230] RBP: ffffffffc114f748 R08: ffffffff90f90000 R09: 0000000000000000
[13.354231] R10: 00000000ffffdfff R11: 00000000ffffffff R12: 0000000000000700
[13.354232] R13: 0000000000000000 R14: 0000000000000000 R15: 000000000001c000
[13.354234] FS:  0000000000000000(0000) GS:ffff8a415dbc0000(0000) knlGS:0000000000000000
[13.354236] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13.354237] CR2: 000055de9c41a2ac CR3: 0000000262e0c006 CR4: 00000000000606e0
[13.354238] Call Trace:
[13.354240]  <TASK>
[13.354242]  SHA2Update+0x2c6/0x320 [icp a234ac88fc385150d04519757fcc356c823dd90b]
[13.354256]  sha_incremental+0x16/0x20 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.354378]  abd_iterate_func+0x18d/0x280 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.354476]  ? raidz_mul_abd_cb+0x8/0x8 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.354574]  ? abd_checksum_off+0x8/0x8 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.354671]  abd_checksum_SHA256+0x85/0xf0 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.354771]  zio_checksum_error_impl+0x4a4/0x6c0 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.354869]  ? resched_curr+0x24/0xf0
[13.354874]  ? ttwu_do_wakeup+0x32/0x1d0
[13.354877]  ? ttwu_queue+0xb6/0x130
[13.354880]  ? vdev_queue_io_to_issue+0x27e/0xbf0 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.354978]  ? dmu_object_set_blocksize+0x10/0x10 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.355075]  zio_checksum_error+0x88/0xd0 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.355173]  zio_checksum_verify+0x9a/0x190 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.355272]  ? zio_vdev_io_done+0x8/0x8 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.355382]  zio_execute+0xc2/0x330 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.355544]  ? raidz_syn_pq_abd+0x8/0x8 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.355701]  taskq_thread+0x402/0x6b0 [spl 4890292e51f285e1e5ab30321e8e467563d4ed5f]
[13.355726]  ? zone_get_hostid+0x10/0x10 [spl 4890292e51f285e1e5ab30321e8e467563d4ed5f]
[13.355744]  ? raidz_syn_pq_abd+0x8/0x8 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.355922]  ? crgetgroups+0x10/0x10 [spl 4890292e51f285e1e5ab30321e8e467563d4ed5f]
[13.355941]  kthread+0x1a4/0x1e0
[13.355946]  ? io_wqe_worker+0x8/0x8
[13.355953]  ret_from_fork+0x22/0x30
[13.355959]  </TASK>
[13.355960] ---[ end trace 302011d136109a8f ]---
[13.356592] ------------[ cut here ]------------

AndyLavr avatar Nov 28 '21 07:11 AndyLavr

@AndyLavr @cybojanek, that's because SHA-{256,512} implementations are getting casted:

sha256_impl = (sha256_block_f)(ops->ctx);
sha512_impl = (sha512_block_f)(ops->ctx);

Function/callback casts are not allowed with ClangCFI as they are considered as attacks. I'd say they should be avoided in general, there's always a way to get strict matches.

(just BTW, I have a commit here in ZFS repo where I fixed all function casts found with ZTS: 23c13c7e807e)

solbjorn avatar Dec 02 '21 12:12 solbjorn

@solbjorn

(just BTW, I have a commit here in ZFS repo where I fixed all function casts found with ZTS: 23c13c7)

You're right. Thanks! :)

AndyLavr avatar Dec 02 '21 19:12 AndyLavr

I gave this branch a spin on my i7-3770 ... I guess it coulda gone better. :smiling_face_with_tear:

Benchmark numbers look fine, this CPU doesn't have NI but sse/avx are a pretty good win:

/proc/spl/kstat/zfs/sha256_bench:4 0 0x01 -1 0 118,315,404,492 536,989,801,750
/proc/spl/kstat/zfs/sha256_bench:implementation   bytes/second   
/proc/spl/kstat/zfs/sha256_bench:fastest          320,043,365      
/proc/spl/kstat/zfs/sha256_bench:generic          203,498,602      
/proc/spl/kstat/zfs/sha256_bench:x86_64           286,613,247      
/proc/spl/kstat/zfs/sha256_bench:sha-avx          243,893,112      
/proc/spl/kstat/zfs/sha256_bench:sha-ssse3        320,043,365      
/proc/spl/kstat/zfs/sha512_bench:5 0 0x01 -1 0 118,325,357,366 536,989,817,440
/proc/spl/kstat/zfs/sha512_bench:implementation   bytes/second   
/proc/spl/kstat/zfs/sha512_bench:fastest          481,871,280      
/proc/spl/kstat/zfs/sha512_bench:generic          339,002,689      
/proc/spl/kstat/zfs/sha512_bench:x86_64           438,785,950      
/proc/spl/kstat/zfs/sha512_bench:sha-avx          481,871,280      
/proc/spl/kstat/zfs/sha512_bench:sha-ssse3        468,769,935

However, stability is a real problem; while heavily accessing my datasets which use SHA512, I get:

  • intermittent I/O errors on the dataset
  • intermittent crashes in other userspace processes which don't touch these datasets, i.e.:
[  384.071162] traps: python3[62580] trap stack segment ip:517df2 sp:7ffe09d50220 error:0 i
n python3.8[423000+295000]
[  410.817533] traps: depmod[68016] general protection fault ip:7f674e271870 sp:7fffccefcc1
0 error:0 in libc-2.31.so[7f674e1f9000+178000]
[  425.750946] traps: sed[71836] general protection fault ip:557320257384 sp:7fff34cf7f40 e
rror:0 in sed[55732024d000+13000]

adamdmoss avatar Dec 29 '21 19:12 adamdmoss

FWIW, I'm using a PREEMPT kernel: Linux version 5.8.0-59-lowlatency (buildd@lcy01-amd64-022) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #66~20.04.1-Ubuntu SMP PREEMPT Thu Jun 17 13:03:02 UTC 2021

vendor_id	: GenuineIntel
cpu family	: 6
model		: 58
model name	: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
stepping	: 9
microcode	: 0x21
cpu MHz		: 1706.813
cache size	: 8192 KB
physical id	: 0
siblings	: 8
core id		: 3
cpu cores	: 4
apicid		: 7
initial apicid	: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags	: vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds
bogomips	: 6784.41
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual

My amateur guess as to the cause of the problem is that it's the same issue that affected the zfs x64 crypto code; I vaguely recall it's something like, Linux doesn't let non-GPL modules save and restore FPU/spicy regs across context switches, and ZFS' workaround for the crypto code was to explicitly forbid pre-emption for the duration of the crypto code...

adamdmoss avatar Dec 29 '21 19:12 adamdmoss

My amateur guess as to the cause of the problem is that it's the same issue that affected the zfs x64 crypto code;

Your guess seems right. Since gcc uses SIMD instructions for optimizing plain C code, failing to preserve the FPU state affects all kind of software, not just the ones doing float calculations. So your modprobe and python fails are lining up with your guess.

ZFS' workaround for the crypto code was to explicitly forbid pre-emption for the duration of the crypto code

Close but not quite. GPL modules can tell the kernel to save and restore FPU state on context switches, meaning the overhead only takes place there. Bering CDDL we can't. On any FPU use we have to disable preemption and save the FPU state and do the reverse when we're done. This incurs quite a big overhead, making the use of SIMD instruction considerably less effective, especially if there are only a few.

AttilaFueloep avatar Dec 29 '21 21:12 AttilaFueloep

Stability appears good now, thanks. I'll keep an eye on it.

adamdmoss avatar Jan 15 '22 19:01 adamdmoss

Close but not quite. GPL modules can tell the kernel to save and restore FPU state on context switches, meaning the overhead only takes place there. Bering CDDL we can't.

On a 5+ kernel that doesn't have the required functions patched back in (like the Liquorix kernel builds), right?

RJVB avatar Jan 17 '22 13:01 RJVB

Is this ready to go or is it still a WIP?

adamdmoss avatar Apr 24 '22 02:04 adamdmoss

When BLAKE3 is in, I would like to add some additional sha256 and sha512 SIMD code. I am currently working on public domain code for choosing the implementation for Intel x86-64, PPC64 and aarch64 architectures....

mcmilk avatar Apr 24 '22 17:04 mcmilk

Github seems to have blackholed my email reply, but: If you'd like to go poke around some more sources of implementations (since more is always better, right), Intel has their implementations of various crypto primitives for all sorts of subsets of x86 and i think aarch64 in a few places under BSD-3 and Apache-2 licenses over at https://github.com/intel/intel-ipsec-mb and https://github.com/intel/ipp-crypto

rincebrain avatar Apr 24 '22 20:04 rincebrain

Github seems to have blackholed my email reply, but: If you'd like to go poke around some more sources of implementations (since more is always better, right), Intel has their implementations of various crypto primitives for all sorts of subsets of x86 and i think aarch64 in a few places under BSD-3 and Apache-2 licenses over at https://github.com/intel/intel-ipsec-mb and https://github.com/intel/ipp-crypto

Yes I know these sources, but I searched for public domain code in the first place ... and will implement SSE2 CC0 code ... which can then be reused on PPC, AARCH64 and so on via SIMDE ... But of cause, the Intel AVX2+AVX512 MIT code will be used also.

mcmilk avatar Apr 25 '22 05:04 mcmilk

Glad to see some activity here!

I pushed out a one line change I forgot to push out a while ago.

cybojanek avatar Apr 26 '22 00:04 cybojanek

@cybojanek Can you may rebase this and give an status update? Or is this PR put behind because of mcmilk mentioned work here? Much thanks anyway.

jumbi77 avatar Aug 05 '22 18:08 jumbi77

@jumbi77 I started a RFC pull request - you could give it a try: https://github.com/openzfs/zfs/pull/13741 But please don't put any importand data onto it... it's just a beginning... therefore the RFC.

mcmilk avatar Aug 05 '22 19:08 mcmilk

@cybojanek Can you may rebase this and give an status update? Or is this PR put behind because of mcmilk mentioned work here? Much thanks anyway.

@mcmilk How does your branch compare to this one?

At a quick glance, it looks like you forked off of master, but also have some new code for generalizing the impl stuff?

Is it only for freebsd? Or also for Linux? Does it do benchmarking?

I don't want to duplicate work nor effort.

cybojanek avatar Aug 08 '22 15:08 cybojanek

My branch does the same for bsd and linux... it's not ready, but I will also try to include the hardware specific impls. of freebsd as well... The intel code you have used, is always a bit slower, so I used openssl. But my branch isn't finished currently. Generic x86-64 and armv4 code needs work... and also the testing of all the changes....

Edit: the first commit of my branch removes ALL old SHA2 stuff... and re-implements the generic function with public domain code. The old Sun/Solaris impl. is history then. The generic code is faster and smaller. But I don't know what the OpenZFS team thinks about this, therefore I firstly started this RFC ... when they say, it generelly is okay... then I will fix the remaining issues.

mcmilk avatar Aug 08 '22 15:08 mcmilk

If you'd like to go poke around some more sources of implementations, Intel has their implementations for all sorts of subsets of x86 and i think aarch64 in a few places under BSD-3 and Apache-2 over at https://github.com/intel/intel-ipsec-mb and https://github.com/intel/ipp-crypto

On Sun, Apr 24, 2022 at 1:03 PM Tino Reichardt @.***> wrote:

When BLAKE3 https://github.com/openzfs/zfs/pull/12918 is in, I would like to add some additional sha256 and sha512 SIMD code. I am currently working https://github.com/mcmilk/sha2-testing on public domain code for choosing the implementation for Intel x86-64, PPC64 and aarch64 architectures....

— Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/pull/12549#issuecomment-1107879482, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUI7PQ4GR7FMBWTRU7H3DVGV5GVANCNFSM5DV5WUXA . You are receiving this because you were mentioned.Message ID: @.***>

rincebrain avatar Oct 11 '22 09:10 rincebrain

If you'd like to go poke around some more sources of implementations, Intel has their implementations for all sorts of subsets of x86 and i think aarch64 in a few places under BSD-3 and Apache-2 over at https://github.com/intel/intel-ipsec-mb and https://github.com/intel/ipp-crypto On Sun, Apr 24, 2022 at 1:03 PM Tino Reichardt @.> wrote: When BLAKE3 <#12918> is in, I would like to add some additional sha256 and sha512 SIMD code. I am currently working https://github.com/mcmilk/sha2-testing on public domain code for choosing the implementation for Intel x86-64, PPC64 and aarch64 architectures.... — Reply to this email directly, view it on GitHub <#12549 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUI7PQ4GR7FMBWTRU7H3DVGV5GVANCNFSM5DV5WUXA . You are receiving this because you were mentioned.Message ID: @.>

Are these not okay? sha256-x86_64.S: x64, SSSE3, AVX, AVX2, SHA-NI (x86_64) sha512-x86_64.S: x64, AVX, AVX2 (x86_64) sha256-armv7.S: ARMv7, NEON, ARMv8-CE (arm) sha512-armv7.S: ARMv7, NEON (arm) sha256-armv8.S: ARMv7, NEON, ARMv8-CE (aarch64) sha512-armv8.S: ARMv7, ARMv8-CE (aarch64) sha256-ppc.S: Generic PPC64 LE/BE (ppc64) sha512-ppc.S: Generic PPC64 LE/BE (ppc64) sha256-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64) sha512-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64)

They are all ready seem to work nice - https://github.com/openzfs/zfs/pull/13741.

mcmilk avatar Oct 11 '22 10:10 mcmilk

@cybojanek - we can maybe close this pull request?

You can see here: https://github.com/mcmilk/sha2-testing - why I have preferred the openssl variants over the Intel ones.

When you try out the current master branch, you can re-check the benchmarks via cat /proc/spl/kstat/zfs/chksum_bench.

mcmilk avatar Mar 05 '23 06:03 mcmilk

I'm glad something got merged in.

Looking forward to seeing it propagate to my distro :D

cybojanek avatar Mar 05 '23 17:03 cybojanek