zfs
zfs copied to clipboard
pool with ashift 12 on luks2 devices with sector size 4k causes repeated io errors
System information
Type | Version/Name |
---|---|
Distribution Name | arch |
Distribution Version | n/a |
Kernel Version | 5.19.13-arch1-1 |
Architecture | x86_64 |
OpenZFS Version | 2.1.6-1 |
Describe the problem you're observing
With an existing pool with ashift 12:
y@arnold % zpool get ashift tank
NAME PROPERTY VALUE SOURCE
tank ashift 12 local
I create a new vdev with cryptsetup luksFormat /dev/sdw
, with cryptsetup version 2.6.1
y@arnold % cryptsetup -V
cryptsetup 2.6.1 flags: UDEV BLKID KEYRING KERNEL_CAPI
This results in a luks (luks2) device (used as a vdev) with a sector size of 4096. Note that /dev/sdw
is the underlying device (a 14 TB hard drive), and /dev/mapper/z14.3
is the cryptsetup device using sdw
.
y@arnold % sudo blockdev --report /dev/sdw
RO RA SSZ BSZ StartSec Size Device
rw 256 512 4096 0 14000519643136 /dev/sdw
y@arnold % sudo blockdev --report /dev/mapper/z14.3
RO RA SSZ BSZ StartSec Size Device
rw 256 4096 4096 0 14000502865920 /dev/mapper/z14.3
y@arnold % sudo dmsetup table | grep z14.3
z14.3: 0 27344732160 crypt aes-xts-plain64 :64:logon:cryptsetup:e5b69a87-e02c-4b14-be58-6109488e172a-d0 0 65:96 32768 2 allow_discards sector_size:4096
y@arnold console % sudo cryptsetup luksDump /dev/sdw
LUKS header information
Version: 2
Epoch: 3
Metadata area: 16384 [bytes]
Keyslots area: 16744448 [bytes]
UUID: e5b69a87-e02c-4b14-be58-6109488e172a
Label: (no label)
Subsystem: (no subsystem)
Flags: (no flags)
Data segments:
0: crypt
offset: 16777216 [bytes]
length: (whole device)
cipher: aes-xts-plain64
sector: 4096 [bytes]
Keyslots:
0: luks2
Key: 512 bits
Priority: normal
Cipher: aes-xts-plain64
Cipher key: 512 bits
PBKDF: argon2id
Time cost: 4
Memory: 1048576
Threads: 4
Salt: xxxx
AF stripes: 4000
AF hash: sha256
Area offset:32768 [bytes]
Area length:258048 [bytes]
Digest ID: 0
Tokens:
Digests:
0: pbkdf2
Hash: sha256
Iterations: 67632
Salt: xxxx
Digest: xxxxx
y@arnold console %
I then add it to my existing pool with zpool replace tank z4.2 z14.3
. Eventually (before the replace/resliver completes), zio reports errors and the vdev is considered failed
y@arnold ~ % zpool status tank | grep -E '(z14.3|NAME)'
NAME STATE READ WRITE CKSUM
z14.3 FAULTED 0 20 0 too many errors
y@arnold ~ % journalctl -f | grep zio
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3772528418816 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3706396516352 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3790784475136 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3980643770368 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3338247639040 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3602675863552 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3504724443136 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3495903948800 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3628836184064 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447128170496 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447125979136 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447126523904 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447126519808 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447126515712 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447130460160 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447139495936 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447147290624 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3011713654784 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3011713769472 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3011714031616 size=4096 flags=1808aa
Next, after detaching the vdev from the pool and cryptsetup close
, use cryptsetup luksFormat --type luks1
instead (To force the use of luks1 instead of a luks2 header).
This results in a vdev with 512 byte sectors:
y@arnold ~ % sudo cryptsetup luksDump /dev/sdw
LUKS header information for /dev/sdw
Version: 1
Cipher name: aes
Cipher mode: xts-plain64
Hash spec: sha256
Payload offset: 4096
MK bits: 512
MK digest: xxxx
MK salt: xxxx
MK iterations: 70468
UUID: e5b69a87-e02c-4b14-be58-6109488e172a
Key Slot 0: ENABLED
Iterations: 1034098
Salt: xxxx
Key material offset: 8
AF stripes: 4000
Key Slot 1: DISABLED
Key Slot 2: DISABLED
Key Slot 3: DISABLED
Key Slot 4: DISABLED
Key Slot 5: DISABLED
Key Slot 6: DISABLED
Key Slot 7: DISABLED
y@arnold ~ % sudo dmsetup table | grep z14.3
z14.3: 0 27344760832 crypt aes-xts-plain64 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 0 65:96 4096 1 allow_discards
y@arnold ~ % sudo blockdev --report /dev/mapper/z14.3
RO RA SSZ BSZ StartSec Size Device
rw 256 512 4096 0 14000517545984 /dev/mapper/z14.3
With this vdev (luks1, 512B sectors), no zio errors are observed and the zpool replace
completes successfully.
https://github.com/openzfs/zfs/issues/13431 describes an issue that might be the same or related (it involves likely newer versions of cryptsetup, which will use luks2). It shows similar zio errors.
https://github.com/openzfs/zfs/issues/13362 also involves luks, and has similar zio errors, but is observing other symptoms (io hangs, though perhaps this is from different operations causing the io).
Can you share what zdb -C
thinks of the situation, and what z4.2
's settings are?
Can you share what
zdb -C
thinks of the situation, and whatz4.2
's settings are?
Here's the zdb -C tank
(tank
is the pool where the replace is occurring). Is there some other zdb info needed?
`zdb -C tank`
MOS Configuration:
version: 5000
name: 'tank'
state: 0
txg: 402676914
pool_guid: 17643633366051522574
errata: 0
hostname: 'arnold'
com.delphix:has_per_vdev_zaps
vdev_children: 11
vdev_tree:
type: 'root'
id: 0
guid: 17643633366051522574
children[0]:
type: 'mirror'
id: 0
guid: 2801673922231546480
metaslab_array: 34
metaslab_shift: 34
ashift: 12
asize: 8001556316160
is_log: 0
create_txg: 4
com.delphix:vdev_zap_top: 177
children[0]:
type: 'disk'
id: 0
guid: 9621628441388751899
path: '/dev/mapper/z8.3'
devid: 'dm-uuid-CRYPT-LUKS1-cb092159d4cb402fa46d68ecab77120a-d8.0'
whole_disk: 0
DTL: 44228
create_txg: 4
com.delphix:vdev_zap_leaf: 454
children[1]:
type: 'disk'
id: 1
guid: 16125960469162989774
path: '/dev/mapper/z8.1'
devid: 'dm-uuid-CRYPT-LUKS1-6d64f9c2c7c144b5be9958fdc2251c37-d8.1'
whole_disk: 0
DTL: 44226
create_txg: 4
com.delphix:vdev_zap_leaf: 388
children[1]:
type: 'mirror'
id: 1
guid: 15581715015551336503
metaslab_array: 383
metaslab_shift: 34
ashift: 12
asize: 8001556316160
is_log: 0
create_txg: 5671340
com.delphix:vdev_zap_top: 180
children[0]:
type: 'disk'
id: 0
guid: 7035198730005854437
path: '/dev/mapper/z8.5'
devid: 'dm-uuid-CRYPT-LUKS1-b0d4731323ee47cba452ba4a65899afe-d8.2'
whole_disk: 0
DTL: 44234
create_txg: 5671340
com.delphix:vdev_zap_leaf: 459
children[1]:
type: 'disk'
id: 1
guid: 516353021259712754
path: '/dev/mapper/z8.4'
devid: 'dm-uuid-CRYPT-LUKS1-b1981692b6d248658979dc934a772f50-d8.3'
whole_disk: 0
DTL: 44233
create_txg: 5671340
com.delphix:vdev_zap_leaf: 178
children[2]:
type: 'mirror'
id: 2
guid: 4796807142241909275
metaslab_array: 260516
metaslab_shift: 35
ashift: 12
asize: 4000746569728
is_log: 0
create_txg: 342952041
com.delphix:vdev_zap_top: 260789
children[0]:
type: 'replacing'
id: 0
guid: 11828178693859806967
whole_disk: 0
create_txg: 342952041
children[0]:
type: 'disk'
id: 0
guid: 11005066825353160123
path: '/dev/mapper/z4.2'
devid: 'dm-uuid-CRYPT-LUKS1-3b6a0f4bfd9947e6916c4d3419ff8757-d4.0'
whole_disk: 0
DTL: 44232
create_txg: 342952041
com.delphix:vdev_zap_leaf: 260790
children[1]:
type: 'disk'
id: 1
guid: 5900828425635517331
path: '/dev/mapper/z14.3'
whole_disk: 0
DTL: 6443865
create_txg: 342952041
com.delphix:vdev_zap_leaf: 6444336
com.datto:resilver_defer
resilver_txg: 402676914
children[1]:
type: 'disk'
id: 1
guid: 4100891465398702400
path: '/dev/mapper/z14.0'
whole_disk: 0
DTL: 6100055
create_txg: 342952041
com.delphix:vdev_zap_leaf: 6100053
children[3]:
type: 'mirror'
id: 3
guid: 5898261406907484203
metaslab_array: 260385
metaslab_shift: 34
ashift: 12
asize: 3000552128512
is_log: 0
create_txg: 342952048
com.delphix:vdev_zap_top: 260792
children[0]:
type: 'disk'
id: 0
guid: 17322213120823058356
path: '/dev/mapper/z12.8'
whole_disk: 0
DTL: 5124768
create_txg: 342952048
com.delphix:vdev_zap_leaf: 5124424
children[1]:
type: 'disk'
id: 1
guid: 7505559382419031062
path: '/dev/mapper/z8.2'
devid: 'dm-uuid-CRYPT-LUKS1-d353aef34afa4f8a9f604c233dd3feae-z8.2'
whole_disk: 0
DTL: 7174672
create_txg: 342952048
com.delphix:vdev_zap_leaf: 7174796
children[4]:
type: 'mirror'
id: 4
guid: 16229305079822685651
metaslab_array: 85023
metaslab_shift: 36
ashift: 12
asize: 10000824532992
is_log: 0
create_txg: 356648774
com.delphix:vdev_zap_top: 85300
children[0]:
type: 'disk'
id: 0
guid: 16663140545990328976
path: '/dev/mapper/z10.2'
devid: 'dm-uuid-CRYPT-LUKS1-f5cc909895b94592870402de317ff4a8-d10.0'
whole_disk: 0
DTL: 94579
create_txg: 356648774
com.delphix:vdev_zap_leaf: 85304
children[1]:
type: 'disk'
id: 1
guid: 16893451224641453851
path: '/dev/mapper/z10.3'
devid: 'dm-uuid-CRYPT-LUKS1-999355cf19c443beaff84ba5b30f82c7-d10.1'
whole_disk: 0
DTL: 94578
create_txg: 356648774
com.delphix:vdev_zap_leaf: 85308
children[5]:
type: 'mirror'
id: 5
guid: 4319439574465811368
metaslab_array: 117706
metaslab_shift: 34
ashift: 12
asize: 10000824532992
is_log: 0
create_txg: 358806971
com.delphix:vdev_zap_top: 117703
children[0]:
type: 'disk'
id: 0
guid: 4473711824735588132
path: '/dev/mapper/z10.4'
devid: 'dm-uuid-CRYPT-LUKS1-a2c194b0e0cb45da88e2b363ade7f274-d10.2'
whole_disk: 0
DTL: 4017
create_txg: 358806971
com.delphix:vdev_zap_leaf: 117704
children[1]:
type: 'disk'
id: 1
guid: 11559068356103030222
path: '/dev/mapper/z10.6'
devid: 'dm-uuid-CRYPT-LUKS1-daba9c630d4a41a79dd416eb50357bea-d10.3'
whole_disk: 0
DTL: 4016
create_txg: 358806971
com.delphix:vdev_zap_leaf: 117705
children[6]:
type: 'mirror'
id: 6
guid: 2479621637901868764
metaslab_array: 331711
metaslab_shift: 34
ashift: 12
asize: 10000824532992
is_log: 0
create_txg: 359781354
com.delphix:vdev_zap_top: 332160
children[0]:
type: 'disk'
id: 0
guid: 6722608118199084570
path: '/dev/mapper/z10.5'
devid: 'dm-uuid-CRYPT-LUKS1-531546f7f1104507bf8756c7323a4be8-d10.4'
whole_disk: 0
DTL: 654007
create_txg: 359781354
com.delphix:vdev_zap_leaf: 331593
children[1]:
type: 'disk'
id: 1
guid: 12670541344625969211
path: '/dev/mapper/z10.1'
devid: 'dm-uuid-CRYPT-LUKS1-0ba5ea88895d416cae69a011852c1afd-d10.5'
whole_disk: 0
DTL: 654004
create_txg: 359781354
com.delphix:vdev_zap_leaf: 331594
children[7]:
type: 'mirror'
id: 7
guid: 2296358563542467893
metaslab_array: 3388903
metaslab_shift: 34
ashift: 12
asize: 12000117129216
is_log: 0
create_txg: 363157204
com.delphix:vdev_zap_top: 3389322
children[0]:
type: 'disk'
id: 0
guid: 6283371758920790449
path: '/dev/mapper/z12.0'
devid: 'dm-uuid-CRYPT-LUKS2-bb711a67822b4a508e522664de603c12-z12.0'
whole_disk: 0
DTL: 3411984
create_txg: 363157204
com.delphix:vdev_zap_leaf: 3389323
children[1]:
type: 'disk'
id: 1
guid: 4263933087880235999
path: '/dev/mapper/z12.2'
devid: 'dm-uuid-CRYPT-LUKS2-9aadb9076c4a469bb795f15ae210a421-z12.2'
whole_disk: 0
DTL: 3411983
create_txg: 363157204
com.delphix:vdev_zap_leaf: 3389324
children[8]:
type: 'mirror'
id: 8
guid: 5328519653678850623
metaslab_array: 3388157
metaslab_shift: 34
ashift: 12
asize: 12000117129216
is_log: 0
create_txg: 363157208
com.delphix:vdev_zap_top: 3388279
children[0]:
type: 'disk'
id: 0
guid: 7274434324538727781
path: '/dev/mapper/z12.1'
devid: 'dm-uuid-CRYPT-LUKS2-5f80c3664ef84441b924042630b8b5e2-z12.1'
whole_disk: 0
DTL: 3411986
create_txg: 363157208
com.delphix:vdev_zap_leaf: 3388280
children[1]:
type: 'disk'
id: 1
guid: 13649946414862804989
path: '/dev/mapper/z12.3'
devid: 'dm-uuid-CRYPT-LUKS2-43faab20d5ee4a0d9bc806515ccd67ad-z12.3'
whole_disk: 0
DTL: 3411985
create_txg: 363157208
com.delphix:vdev_zap_leaf: 3388281
children[9]:
type: 'mirror'
id: 9
guid: 11144534952233533298
metaslab_array: 7474010
metaslab_shift: 34
ashift: 12
asize: 12000117129216
is_log: 0
create_txg: 371518314
com.delphix:vdev_zap_top: 7474463
children[0]:
type: 'disk'
id: 0
guid: 13495647563734244283
path: '/dev/mapper/z12.5'
devid: 'dm-uuid-CRYPT-LUKS2-7328ac33263b4b35addda5a1934f1cea-z12.5'
whole_disk: 0
DTL: 7412238
create_txg: 371518314
com.delphix:vdev_zap_leaf: 7474464
children[1]:
type: 'disk'
id: 1
guid: 8601802118682992312
path: '/dev/mapper/z12.6'
devid: 'dm-uuid-CRYPT-LUKS2-329d228eec274842b0aba3e029044aba-z12.6'
whole_disk: 0
DTL: 7412237
create_txg: 371518314
com.delphix:vdev_zap_leaf: 7474342
children[10]:
type: 'mirror'
id: 10
guid: 5574456015292304252
metaslab_array: 7613777
metaslab_shift: 34
ashift: 12
asize: 12000117129216
is_log: 0
create_txg: 371857357
com.delphix:vdev_zap_top: 7613774
children[0]:
type: 'disk'
id: 0
guid: 11633448460096827575
path: '/dev/mapper/z12.4'
whole_disk: 0
DTL: 7414491
create_txg: 371857357
com.delphix:vdev_zap_leaf: 7613775
children[1]:
type: 'disk'
id: 1
guid: 16104453434150649249
path: '/dev/mapper/z12.7'
whole_disk: 0
DTL: 7414490
create_txg: 371857357
com.delphix:vdev_zap_leaf: 7613776
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
space map refcount mismatch: expected 5484 != actual 5436
In the details, one can see the replacement of z4.2
with z14.3
. I previously replaced another disk (now z14.0
) in the same way, which is why I know using the --type luks1
option resolves the IO errors.
Here are the details for z4.2
(sdy
is the backing device for z4.2
)
y@arnold ~ % sudo blockdev --report /dev/mapper/z4.2
RO RA SSZ BSZ StartSec Size Device
rw 256 512 512 0 4000751378944 /dev/mapper/z4.2
y@arnold ~ % sudo blockdev --report /dev/sdy
RO RA SSZ BSZ StartSec Size Device
rw 256 512 512 0 4000753476096 /dev/sdy
y@arnold ~ % sudo cryptsetup luksDump /dev/sdy
LUKS header information for /dev/sdy
Version: 1
Cipher name: aes
Cipher mode: xts-plain64
Hash spec: sha512
Payload offset: 4096
MK bits: 256
MK digest: xxxx
MK salt: xxxx
MK iterations: 24750
UUID: 3b6a0f4b-fd99-47e6-916c-4d3419ff8757
Key Slot 0: ENABLED
Iterations: 99533
Salt: xxxx
Key material offset: 8
AF stripes: 4000
Key Slot 1: DISABLED
Key Slot 2: DISABLED
Key Slot 3: DISABLED
Key Slot 4: DISABLED
Key Slot 5: DISABLED
Key Slot 6: DISABLED
Key Slot 7: DISABLED
y@arnold ~ % sudo dmsetup table | grep z4.2
z4.2: 0 7813967537 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 65:128 4096 1 allow_discards
From the zdb -C
output in the previous comment, one can see that some luks2 devices are in the pool (and have not been observed to emit io errors).
Taking z12.0
as a example device, here's some extra info about it:
y@arnold ~ % sudo blockdev --report /dev/mapper/z12.0
RO RA SSZ BSZ StartSec Size Device
rw 256 512 4096 0 12000121847808 /dev/mapper/z12.0
y@arnold ~ % sudo blockdev --report /dev/sdr
RO RA SSZ BSZ StartSec Size Device
rw 256 512 4096 0 12000138625024 /dev/sdr
y@arnold ~ % sudo cryptsetup luksDump /dev/sdr
LUKS header information
Version: 2
Epoch: 3
Metadata area: 16384 [bytes]
Keyslots area: 16744448 [bytes]
UUID: bb711a67-822b-4a50-8e52-2664de603c12
Label: (no label)
Subsystem: (no subsystem)
Flags: (no flags)
Data segments:
0: crypt
offset: 16777216 [bytes]
length: (whole device)
cipher: aes-xts-plain64
sector: 512 [bytes]
Keyslots:
0: luks2
Key: 512 bits
Priority: normal
Cipher: aes-xts-plain64
Cipher key: 512 bits
PBKDF: argon2i
Time cost: 6
Memory: 1048576
Threads: 4
Salt: xxxx
AF stripes: 4000
AF hash: sha256
Area offset:32768 [bytes]
Area length:258048 [bytes]
Digest ID: 0
Tokens:
Digests:
0: pbkdf2
Hash: sha256
Iterations: 141241
Salt: xxxx
Digest: xxxx
y@arnold ~ % sudo dmsetup table | grep z12.0
z12.0: 0 23437737984 crypt aes-xts-plain64 :64:logon:cryptsetup:bb711a67-822b-4a50-8e52-2664de603c12-d0 0 65:16 32768 1 allow_discards
Note that while z12.0
/sdr
(the working fine device) is using luks2, it has a sector size of 512B (and not 4096B like the devices that emit errors)
I did notice that, yes.
While trying to reproduce this, I discovered I can convince ZFS without much work to try replacing a 512n device with a 4kn device on an ashift 9 vdev, which, uh, Does Not End Well At All.
But that doesn't seem to be what happened to you, here.
I'm now wondering if somehow it stashed somethings on the vdev not 4k aligned and because it was a 512n device it went fine, but now trying to 1:1 mirror is going bonkers.
...o-oh. I had a bad idea, actually. I wonder if the partition isn't 4k aligned, and in trying to replicate the partition table it's resulting in non-4k aligned accesses...let me go read those IO errors you pasted again.
e: well, LUKS, so not exactly a partition, but like, the leading offset...anyway.
edit 2: are you seeing any errors not from ZFS in your syslog from the disk itself?
edit 2: are you seeing any errors not from ZFS in your syslog from the disk itself?
Watching dmesg -w | grep -v audit:
(to remove all the noisy audit messages on my system) shows only the zio errors occurring.
Here's another set: (detached z14.3 and re-luksFormatted it in luks2 to get this output)
y@arnold ~ % sudo dmesg -w | grep -v 'audit:'
[12188887.373045] sctp: Hash tables configured (bind 1024/1024)
[12188914.429696] kauditd_printk_skb: 10 callbacks suppressed
[12188919.981251] kauditd_printk_skb: 35 callbacks suppressed
[12191505.790358] kauditd_printk_skb: 126 callbacks suppressed
[12195028.411209] kauditd_printk_skb: 22 callbacks suppressed
[12198268.934797] kauditd_printk_skb: 27 callbacks suppressed
[12198402.341190] kauditd_printk_skb: 27 callbacks suppressed
[12198494.544891] kauditd_printk_skb: 27 callbacks suppressed
[12212089.961177] kauditd_printk_skb: 30 callbacks suppressed
[12212632.706882] kauditd_printk_skb: 6 callbacks suppressed
[12213959.402093] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=1782776037376 size=20480 flags=1808aa
[12213965.097787] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=1784842436608 size=4096 flags=1808aa
[12213968.581657] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=1778357694464 size=12288 flags=1808aa
[12213979.887437] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=1779879501824 size=4096 flags=1808aa
[12214226.435283] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3849360289792 size=20480 flags=1808aa
[12214227.015763] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3850519625728 size=32768 flags=1808aa
[12214756.511994] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3808536850432 size=4096 flags=1808aa
so: no errors from the device, sd, etc about unaligned writes (or any other error of any kind). And the offsets listed in the zio messages are aligned to 4096, and all the sizes logged are also multiples of 4096.
As far as offsets in the luks data, the luksDump
output above includes the offset used for the data segment as 16777216, which is divisible by 4096.
The dmsetup table
output (again, included in previous comments) shows a start of 32768, which is correct if the units for it are 512B sectors (which it might be, the documentation here is unclear, and I haven't been able to make out what the linux kernel intends)
All of the luks formatting is done on the entire disk (iow, running cryptsetup luksFormat /dev/sdv
), without any partitions. So the offset luks provides is the only offset in the physical disk before it is handed off to zfs.
An interesting data point would be to see if the first 100 or so errors you get when doing the replace with a LUKS2 header are the same every time, as that might tell us more about whether it's deterministic or something very strange...
@jmesmon thank you for this thread, I really was confused if I have problem with XHCI, disk, zfs, LUKS or all together.
It seems tho that LUKS device sector size doesn't matter, I use 4K physical/logical sector size disks so my LUKS device is 4K sector size event with --type luks1
blockdev --report /dev/sdb
RO RA SSZ BSZ StartSec Size Device
rw 1048544 4096 16384 0 16000900661248 /dev/sdb
blockdev --report /dev/mapper/crypt_disk-a
RO RA SSZ BSZ StartSec Size Device
rw 1048544 4096 4096 0 16000632229888 /dev/mapper/crypt_disk-a
I use -o ashift=12, but the math is simple: LUKS1 header = no problem LUKS2 header = constant stream of zio errors like
[ 5816.334320] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=88591015936 size=32768 flags=40080c80
[ 5816.334337] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=183667879936 size=8192 flags=40080c80
[ 5816.334353] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=1506734080 size=8192 flags=40080c80
[ 5816.335251] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=1506734080 size=4096 flags=188881
[ 5816.335556] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=88591028224 size=4096 flags=188881
[ 5816.335731] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=183667879936 size=4096 flags=188881
[ 5816.342210] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=88591106048 size=49152 flags=40080c80
The problem, when I use LUKS2 header, occurs when constantly writing to the device with max speed (ie. copy large set of data). It's funny that it is less probable to occur if writing with lower speed, but eventually it is happening also.
I really have no idea if this is a problem with zfs of LUKS. Any ideas?
The problem, when I use LUKS2 header, occurs when constantly writing to the device with max speed (ie. copy large set of data). It's funny that it is less probable to occur if writing with lower speed, but eventually it is happening also.
How full is your pool?
I hit this issue a good while back and thought it might be my SSD (WD SN850) randomly disconnecting since I could find some reports about that on the WD forums. Though it seemed to be related to how full the pool/drive was, with enough free space I couldn't trigger it. Switching to another (and larger) drive solved the issue, until now that I've filled it too.
The best way for me to trigger it also hasn't been writing at full (sequential) speed but decompressing and compiling chromium, which is ~906000 files over 19GB.
Currently on a 3200GiB partition it seems like the breakpoint when it starts occuring is somewhere around 90% allocated, and it came to mind that maybe it's here that zfs changes allocation method? Slightly over 200GB is reserved for zvols though, so the free space for filesystems is ~100GB. Unclear if this affects it.
# blockdev --report /dev/mapper/root
RO RA SSZ BSZ StartSec Size Device
rw 256 4096 4096 0 3435957059584 /dev/mapper/root
# blockdev --report /dev/nvme0n1p3
RO RA SSZ BSZ StartSec Size Device
rw 256 4096 4096 270534656 3435973836800 /dev/nvme0n1p3
The drive itself is formated to 4K sectors so misalignment shouldn't be possible (?)
# fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 3,73 TiB, 4096805658624 bytes, 1000196694 sectors
Disk model: KINGSTON SKC3000D4096G
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 081BA126-DA19-42BD-AEAB-CF1A6583CA0B
Device Start End Sectors Size Type
/dev/nvme0n1p1 256 262399 262144 1G EFI System
/dev/nvme0n1p2 262400 33816831 33554432 128G Linux swap
/dev/nvme0n1p3 33816832 872677631 838860800 3,1T Linux filesystem
# cryptsetup luksDump /dev/nvme0n1p3
LUKS header information
Version: 2
Epoch: 3
Metadata area: 16384 [bytes]
Keyslots area: 16744448 [bytes]
How full is your pool?
Almost empty, when triggering the bug. It's a new system (brand new disks). I admit that it's a bit unusual - Gentoo arm64 /w Asahi kernel @ Mac Mini M1 ;) But otherwise it is rock stable, zfs-2.1.13.
Also I don't see any problems with drives in dmesg.
Currently, with LUKS1 header I'm seeing 4TB usage: couple of zvol's and 10 697 081 files in fs - no zio errors whatsoever.
Hitting the recent bug 15533 I saw the same zio error=5 type=2 as in this bug, so I started to search for this issue to check similarities. While searching I happened upon this pull request: #15414
The identified triggers there interestingly are:
- a write-heavy load, such that aggregation past 512K happened quite frequently
- a pool approaching 90% fragmentation, such that gang blocks were being produced (this is significant only insofar as gang blocks are backed by small memory allocations, which exacerbate the problem)
which seem to match the identified triggers in this issue too. The PR is closed in favour of a upcoming new take though.
@robn I can't find any new PR, but I'd be interested in testing if what you got fixes this issue too.
I should be posting a significant rework of vdev_disk
in a few days (it was written for a client, and is just finishing testing). That at least will fix up the problem described in #15414, and I suspect this too. But I don't really recommend waiting when a revert will sort it out for now; its not a given that my patch will be right, or be accepted.
Its not totally clear to me that this is a result of misaligned aggregation, but you might try drastically lowering zfs_vdev_aggregation_limit
(make it, say, 131072
; if it "fixes" it, try raising it, if not, try lowering it further). This will reduce throughput considerably, but things might still work.
(more in https://github.com/openzfs/zfs/issues/15533#issuecomment-1825326626).
PR with possible fix from robn: https://github.com/openzfs/zfs/pull/15588 (linking for my reference)
FYI, 2.2.4 just shipped, with #15588 and followup patches included. If you are still having this problem, you might try setting zfs_vdev_disk_classic=0
in your zfs
module parameters and seeing if that helps. If you do try this, please report back with your results, as our hope is to make this the default in the future.