Input/output error in recent snapshot; three times on same host now
System information
| Type | Version/Name |
|---|---|
| Distribution Name | Ubuntu/Linux |
| Distribution Version | Jammy |
| Kernel Version | 5.15.0-86-generic |
| Architecture | x86_64 |
| OpenZFS Version | 2.1.5-1ubuntu6~22.04.1 + ossomacfix (see lp#1987190#14) |
Problem observation
Input/output error
(Times below without a Z are in CET, i.e. 06:00 is 05:00Z.)
I have snapshots:
# ls -l /var/lib/mysql/redacted/.zfs/snapshot/
total 8
drwxr-xr-x 7 mysql mysql 21 okt 17 10:26 planb-20231101T0200Z
drwxr-xr-x 7 mysql mysql 21 okt 17 10:26 planb-20231101T0300Z
drwxr-xr-x 7 mysql mysql 21 okt 17 10:26 planb-20231101T0400Z
drwxr-xr-x 7 mysql mysql 21 okt 17 10:26 planb-20231101T0500Z
drwxr-xr-x 7 mysql mysql 21 okt 17 10:26 planb-20231101T0700Z
But the (now) second to last one, I cannot access.
# ls -l /var/lib/mysql/redacted/.zfs/snapshot/planb-20231101T0500Z
ls: cannot access '/var/lib/mysql/redacted/.zfs/snapshot/planb-20231101T0500Z/aria_log.00000001': Input/output error
ls: cannot access '/var/lib/mysql/redacted/.zfs/snapshot/planb-20231101T0500Z/aria_log_control': Input/output error
total 10129
-????????? ? ? ? ? ? aria_log.00000001
-????????? ? ? ? ? ? aria_log_control
-rw-rw---- 1 mysql mysql 0 nov 1 02:55 ddl.log
-rw-rw---- 1 mysql mysql 16384 okt 17 10:26 ddl_recovery.log
-rw-rw---- 1 mysql mysql 10737419544 nov 1 02:55 galera.cache
-rw-rw---- 1 mysql mysql 8 okt 17 10:26 galera.pid
...
Some files have stat issues (see the ? above). None of the files can be read except for the empty file:
# cat /var/lib/mysql/redacted/.zfs/snapshot/planb-20231101T0500Z/mysql_upgrade_info
cat: /var/lib/mysql/redacted/.zfs/snapshot/planb-20231101T0500Z/mysql_upgrade_info: Input/output error
# cat /var/lib/mysql/redacted/.zfs/snapshot/planb-20231101T0500Z/galera.pid
cat: /var/lib/mysql/redacted/.zfs/snapshot/planb-20231101T0500Z/galera.pid: Input/output error
# cat /var/lib/mysql/redacted/.zfs/snapshot/planb-20231101T0500Z/ddl.log
(Reading the 0 byte file went fine.)
Checking zpool status:
# zpool status
...
errors: 941 data errors, use '-v' for a list
Most of those 941 data errors were from the battle I had with the previous snapshot that went bad. I deleted that bad snapshot yesterday, in the hope that it was just a fluke. That snapshot concerned a different fileset by the way: tank/mysql/redacted2/data
# zpool status -v
...
<0x11ba1>:<0x3ff> <-- the other (redacted2) snapshot
<0x11ba1>:<0x6ff>
tank/mysql/redacted/data@planb-20231101T0500Z:<0x0>
tank/mysql/redacted/data@planb-20231101T0500Z:<0x201>
tank/mysql/redacted/data@planb-20231101T0500Z:<0x6>
tank/mysql/redacted/data@planb-20231101T0500Z:<0x20>
tank/mysql/redacted/data@planb-20231101T0500Z:<0x21>
tank/mysql/redacted/data@planb-20231101T0500Z:<0x181>
How to reproduce the problem
I don't know yet. This is the first such issue I have encountered:
- there is encryption enabled:
# zfs get all tank/mysql/redacted/data | grep encr
tank/mysql/redacted/data encryption aes-256-gcm -
tank/mysql/redacted/data encryptionroot tank/mysql -
- we do hourly-ish snapshots and sync those to another location that does not hold the encryption keys;
- creating a new snapshot succeeds, but doing the incremental sync fails, so that's why I have
planb-20231101T0500Zandplanb-20231101T0700Z; the 0700Z one appears to be readable. - (not sure yet why the attempt at 0600Z was not done, but that's irrelevant for the discussion);
- right now, auto-backups have stopped. otherwise I would have a bunch more snapshots after
@planb-20231101T0500Z, but none that can be synced, because every sync starting from 0400Z tries to include 0500Z as well, and that fails. - the other broken snapshot was
planb-20231030T0500Z, so also at 05:00, but an even earlier failure was intank/mysql/redacted3/data@planb-20231027T1800Z. - these failures started appearing after this machine was upgraded from Focal (zfs 0.8.3) to Jammy (zfs 2.1.5) on Oct 11.
Include any warning/errors/backtraces from the system logs
dmesg
Empty. Nothing after boot (Oct 17).
/proc/spl/kstat/zfs/dbgmsg
I got this from the debug log, which isn't much:
# cat /proc/spl/kstat/zfs/dbgmsg | wtimedecode
timestamp message
2023-11-01T09:29:13+0100 (1698827353) zfs_ctldir.c:1106:zfsctl_snapshot_mount(): Unable to automount /var/lib/mysql/redacted2/.zfs/snapshot/planb-20231030T0500Z error=256
This is strange, because at 2023-11-01 that (redacted2) snapshot should've been gone already. I was now debugging the redacted one.
After clearing the log (echo 0 > /proc/spl/kstat/zfs/dbgmsg) and enabling (random) extra stuff (echo 1 > /sys/module/zfs/parameters/spa_load_print_vdev_tree) I now have this:
# cat /proc/spl/kstat/zfs/dbgmsg | wtimedecode
timestamp message
2023-11-01T09:40:28+0100 (1698828028) metaslab.c:3650:metaslab_condense(): condensing: txg 14928129, msp[128] ffff9bbc683ba000, vdev id 0, spa tank, smp size 1619424, segments 100180, forcing condense=FALSE
2023-11-01T09:49:05+0100 (1698828545) metaslab.c:3650:metaslab_condense(): condensing: txg 14928230, msp[195] ffff9bbc685e4000, vdev id 0, spa tank, smp size 727776, segments 45449, forcing condense=FALSE
2023-11-01T09:49:10+0100 (1698828550) metaslab.c:3650:metaslab_condense(): condensing: txg 14928231, msp[172] ffff9bbc684f8000, vdev id 0, spa tank, smp size 993776, segments 61730, forcing condense=FALSE
2023-11-01T10:04:27+0100 (1698829467) metaslab.c:3650:metaslab_condense(): condensing: txg 14928410, msp[63] ffff9bbc6819c000, vdev id 0, spa tank, smp size 1430544, segments 87916, forcing condense=FALSE
2023-11-01T10:11:21+0100 (1698829881) metaslab.c:3650:metaslab_condense(): condensing: txg 14928491, msp[172] ffff9bbc684f8000, vdev id 0, spa tank, smp size 951408, segments 59411, forcing condense=FALSE
This means nothing to me, and could just be debug info from the extra spa_load_print_vdev_tree I enabled.
zed
# journalctl -S '00:00' -t zed | sed -e 's/^nov 01 //;s/ [^ ]* / /'
06:01:00 zed[2572813]: eid=31087 class=authentication pool='tank' bookmark=107766:0:0:42
08:00:53 zed[3154622]: eid=31131 class=data pool='tank' priority=2 err=5 flags=0x180 bookmark=107766:0:0:42
08:00:53 zed[3154626]: eid=31133 class=data pool='tank' priority=2 err=5 flags=0x80 bookmark=107766:0:0:42
08:00:53 zed[3154625]: eid=31132 class=authentication pool='tank' priority=2 err=5 flags=0x80 bookmark=107766:0:0:42
09:24:13 zed[3514527]: eid=31137 class=authentication pool='tank' priority=0 err=5 flags=0x80 bookmark=107766:32:0:0
09:24:13 zed[3514529]: eid=31138 class=data pool='tank' priority=0 err=5 flags=0x80 bookmark=107766:32:0:0
09:29:06 zed[3534332]: eid=31139 class=authentication pool='tank' priority=0 err=5 flags=0x80 bookmark=107766:32:0:0
09:29:06 zed[3534334]: eid=31140 class=data pool='tank' priority=0 err=5 flags=0x80 bookmark=107766:32:0:0
09:29:06 zed[3534337]: eid=31141 class=authentication pool='tank' priority=0 err=5 flags=0x80 bookmark=107766:32:0:0
09:29:06 zed[3534338]: eid=31142 class=data pool='tank' priority=0 err=5 flags=0x80 bookmark=107766:32:0:0
09:30:52 zed[3541004]: eid=31143 class=authentication pool='tank' bookmark=107766:0:0:42
09:30:52 zed[3541005]: eid=31144 class=authentication pool='tank' bookmark=107766:0:0:42
09:32:06 zed[3546489]: eid=31145 class=authentication pool='tank' bookmark=107766:0:0:42
09:32:06 zed[3546490]: eid=31146 class=authentication pool='tank' bookmark=107766:0:0:42
09:32:08 zed[3546592]: eid=31147 class=authentication pool='tank' bookmark=107766:0:0:42
09:32:08 zed[3546593]: eid=31148 class=authentication pool='tank' bookmark=107766:0:0:42
09:42:15 zed[3587704]: eid=31149 class=authentication pool='tank' bookmark=107766:0:0:42
09:42:15 zed[3587705]: eid=31150 class=authentication pool='tank' bookmark=107766:0:0:42
09:42:28 zed[3588614]: eid=31151 class=authentication pool='tank' bookmark=107766:0:0:42
09:42:28 zed[3588615]: eid=31152 class=authentication pool='tank' bookmark=107766:0:0:42
09:42:33 zed[3589470]: eid=31153 class=authentication pool='tank' bookmark=107766:0:0:42
09:42:33 zed[3589471]: eid=31154 class=authentication pool='tank' bookmark=107766:0:0:42
09:42:53 zed[3590398]: eid=31155 class=authentication pool='tank' priority=0 err=5 flags=0x880 bookmark=107766:513:0:0
09:42:53 zed[3590400]: eid=31156 class=data pool='tank' priority=0 err=5 flags=0x880 bookmark=107766:513:0:0
09:49:38 zed[3615957]: eid=31157 class=authentication pool='tank' priority=0 err=5 flags=0x880 bookmark=107766:513:0:0
09:49:38 zed[3615959]: eid=31158 class=data pool='tank' priority=0 err=5 flags=0x880 bookmark=107766:513:0:0
09:50:44 zed[3620982]: eid=31159 class=authentication pool='tank' bookmark=107766:0:0:42
09:50:44 zed[3620983]: eid=31160 class=authentication pool='tank' bookmark=107766:0:0:42
09:50:54 zed[3621521]: eid=31161 class=authentication pool='tank' bookmark=107766:0:0:42
09:50:54 zed[3621522]: eid=31162 class=authentication pool='tank' bookmark=107766:0:0:42
09:50:55 zed[3621526]: eid=31164 class=authentication pool='tank' bookmark=107766:0:0:42
09:50:55 zed[3621527]: eid=31163 class=authentication pool='tank' bookmark=107766:0:0:42
09:51:09 zed[3622291]: eid=31165 class=authentication pool='tank' priority=0 err=5 flags=0x800880 bookmark=107766:6:0:0
"When"
# zfs list -oname,creation -t snapshot -r tank/mysql/redacted/data
NAME CREATION
tank/mysql/redacted/data@planb-20231101T0200Z wo nov 1 3:01 2023
tank/mysql/redacted/data@planb-20231101T0300Z wo nov 1 4:00 2023
tank/mysql/redacted/data@planb-20231101T0400Z wo nov 1 5:00 2023
tank/mysql/redacted/data@planb-20231101T0500Z wo nov 1 6:01 2023 <-- 05:01Z
tank/mysql/redacted/data@planb-20231101T0700Z wo nov 1 8:00 2023 <-- 07:00Z
First failure (at 05:01:04Z), seen on the receiving end:
warning: cannot send 'tank/mysql/redacted/data@planb-20231101T0500Z': Input/output error
cannot receive incremental stream: checksum mismatch
Second failure (at 07:00:56Z), seen on the receiving end:
warning: cannot send 'tank/mysql/redacted/data@planb-20231101T0500Z': Input/output error
cannot receive incremental stream: invalid backup stream
The other side?
The other side has no snapshots after tank/mysql/redacted/data@planb-20231101T0400Z. The snapshots that do exist are likely fine (including 0400Z).
How to fix?
As I mentioned, I yesterday had the same issue with a bigger snapshot (redacted2). There I destroyed the snapshot, and resumed with a snapshot which I still had on both sides (the 0400Z one).
I started a scrub then too, which finished before the new issues started:
scan: scrub repaired 0B in 01:51:52 with 0 errors on Tue Oct 31 19:17:19 2023
Interestingly, the mentioned problems in zpool status -v were still there.
Questions
What other debugging can I do/enable/run/check?
I can probably leave this in this state for a short while, hopefully not too long, while I break out the debugging tools.
Can you help with with some zfs/zdb debugging foo so we can get to the bottom of this?
Thanks! Walter Doekes OSSO B.V.
Per your suggestion in https://github.com/openzfs/zfs/issues/14911#issuecomment-1570013030 I compiled zdb from git and ran it here.
If I look at the diff between the zdb -K <key> -dbdbdbdbdbdb of the 0400Z and 0500Z (broken) snapshot, nothing obvious stands out:
Same with the diff between 0500Z (broken) and 0700Z:
Nor with a diff between two good snapshots 0400Z and 0700Z:
And the difference between the dataset and the snapshot (sans inherit=lz4 -> inherit=inherit):
# diff -pu <(sed -e 's/inherit=lz4/inherit=inherit/' tank-mysql-redacted-data.zdb-db6-K.log) \
tank-mysql-redacted-data-AT-planb-20231101T0500Z.zdb-db6-K.log
zdb-K-db6-dataset-vs-0500.diff.gz
I'd provide the full files, if you think there is useful info in there. I just have to know I'm not leaking anything, so I'd have to do some more redacting.
I'd try running a version that's current like 2.1.13 or 2.2.0 and seeing if still produces EIOIO, rather than hoping whatever Ubuntu is shipping isn't on fire.
Okay. I've spent some time getting proper deb packages which I can manage in an apt repository. (This meant getting reproducible builds right, because repository managers mandate that two identically named files must be identical.)
https://github.com/ossobv/zfs-kmod-build-deb
This appears to be sufficient to get things up and running.
-
I loaded the new module and rebooted.
-
I ran a scrub: only the errors of the still existing snapshot remained.
errors: 943 data errors, use '-v' for a list->errors: 6 data errors, use '-v' for a list -
I read the files: I could now read all the files in the (previously corrupted) snapshot. Error refs in
zpool status -vchanged from:errors: Permanent errors have been detected in the following files: tank/mysql/redacted/data@planb-20231101T0500Z:<0x0> tank/mysql/redacted/data@planb-20231101T0500Z:/mysql_upgrade_info tank/mysql/redacted/data@planb-20231101T0500Z:/ddl_recovery.log tank/mysql/redacted/data@planb-20231101T0500Z:<0x20> tank/mysql/redacted/data@planb-20231101T0500Z:<0x21> tank/mysql/redacted/data@planb-20231101T0500Z:/galera.pidto:
errors: Permanent errors have been detected in the following files: tank/mysql/redacted/data@planb-20231101T0500Z:<0x0> /var/lib/mysql/redacted/.zfs/snapshot/planb-20231101T0500Z/mysql_upgrade_info /var/lib/mysql/redacted/.zfs/snapshot/planb-20231101T0500Z/ddl_recovery.log tank/mysql/redacted/data@planb-20231101T0500Z:<0x20> tank/mysql/redacted/data@planb-20231101T0500Z:<0x21> /var/lib/mysql/redacted/.zfs/snapshot/planb-20231101T0500Z/galera.pid -
I ran the
zfs send: this now worked. (I did not check the received data, but got no complaints from the remote zfs recv.) -
I ran a second scrub: now the last of the data errors disappeared.
errors: No known data errors
That makes me a happy camper.
With regards to self-compiled builds. There are a few items that would be nice to get fixed here:
- alien now strips custom (extra) version info when it should not (
--bump0->--keep-version) - rpmbuild creates builds with differing
NT_GNU_BUILD_IDbecause of a lack of-ffile-prefix-map=and the use of "random" mktemp paths - rpmbuild uses the build hostname and build timestamps
See https://github.com/ossobv/zfs-kmod-build-deb/blob/3ddbe5fb2f09fda9724b232e6f90903d524abed8/Dockerfile#L60-L81 and https://github.com/ossobv/zfs-kmod-build-deb/blob/3ddbe5fb2f09fda9724b232e6f90903d524abed8/dpkg-repackage-reproducible .
Best would probably be to create proper debian packaging infra. Then I could also ensure that e.g. libzpool5 conflicts with libzpool5linux, which is likely going to be my next hurdle.
Thinking out loud...
In any case: this particular issue is fixed with 2.1.13. Thanks!
Walter Doekes OSSO B.V.
Bad news.
The error reappeared in snapshot 2013-11-13 22:00 Z. Same system, 2.1.13 ZFS module now.
Versions
# zfs version
zfs-2.1.5-1ubuntu6~22.04.1
zfs-kmod-2.1.13-1osso1
# uptime
09:16:14 up 10 days, 19:29, 2 users, load average: 1.31, 1.42, 1.53
# cat /sys/module/zfs/version
2.1.13-1osso1
We're still running the Ubuntu provided userspace tools (2.1.5), but we run the 2.1.13 kernel module kmod-zfs-5.15.0-88-generic_2.1.13-1osso1_amd64.deb:
https://github.com/ossobv/zfs-kmod-build-deb/releases/tag/zfs-2.1.13-1osso1
(built by ./buildone.sh jammy 5.15.0-88-generic osso1)
Errors
Error seen in zpool status:
# zpool status -v
pool: tank
state: ONLINE
...
scan: scrub repaired 0B in 01:01:44 with 0 errors on Fri Nov 3 13:27:42 2023
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
nvme-XXX ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
tank/mysql/redacted3/data@planb-20231113T2200Z:<0x0>
tank/mysql/redacted3/data@planb-20231113T2200Z:<0x20>
# zfs list -oname,creation -t snapshot -r tank/mysql/redacted3
NAME CREATION
tank/mysql/redacted3/data@planb-20231113T1900Z ma nov 13 20:00 2023
tank/mysql/redacted3/data@planb-20231113T2000Z ma nov 13 21:00 2023
tank/mysql/redacted3/data@planb-20231113T2100Z ma nov 13 22:00 2023
tank/mysql/redacted3/data@planb-20231113T2200Z ma nov 13 23:00 2023
tank/mysql/redacted3/data@planb-20231114T0124Z di nov 14 2:24 2023
tank/mysql/redacted3/data@planb-20231114T0300Z di nov 14 4:00 2023
tank/mysql/redacted3/data@planb-20231114T0500Z di nov 14 6:00 2023
tank/mysql/redacted3/data@planb-20231114T0700Z di nov 14 8:00 2023
tank/mysql/redacted3/log@planb-20231113T1900Z ma nov 13 20:01 2023
tank/mysql/redacted3/log@planb-20231113T2000Z ma nov 13 21:00 2023
tank/mysql/redacted3/log@planb-20231113T2100Z ma nov 13 22:00 2023
Followup
I'll try and replace the userland tools with the 2.1.13 builds and see if anything changes.
@rincebrain: if you have other (zdb?) tips for debugging in the mean time, I'm interested.
But the files were readable in the snapshot -- at least the ones I tried.
Error state:
errors: Permanent errors have been detected in the following files:
tank/mysql/redacted3/data@planb-20231113T2200Z:<0x0>
tank/mysql/redacted3/data@planb-20231113T2200Z:<0x20>
Removing the snapshot:
errors: Permanent errors have been detected in the following files:
<0x8416>:<0x0>
<0x8416>:<0x20>
Doing a scrub:
errors: No known data errors
... running 2.1.13 userland tools now. We'll see if the error reappears.
And now, with userland at 2.1.13 and module 2.1.13, after a reboot, there is a new data error.
# zfs version
zfs-2.1.13-1osso1
zfs-kmod-2.1.13-1osso1
# uname -r
5.15.0-88-generic
errors: Permanent errors have been detected in the following files:
tank/mysql/redacted/data@planb-20231126T1900Z:<0x0>
Interestingly, just after checking (cause?) with zpool status -v, a second error appeared:
errors: Permanent errors have been detected in the following files:
tank/mysql/redacted/data@planb-20231126T1900Z:<0x0>
tank/mysql/redacted/data@planb-20231126T1900Z:<0x20>
dmesg has nothing (no disk/IO errors or anything). The storage seems fine too:
# nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning : 0
temperature : 33 C (306 Kelvin)
available_spare : 99%
available_spare_threshold : 10%
percentage_used : 8%
endurance group critical warning summary: 0
data_units_read : 2.743.533.883
data_units_written : 1.743.454.820
host_read_commands : 251.349.221.948
host_write_commands : 46.896.760.515
controller_busy_time : 18.088
power_cycles : 47
power_on_hours : 26.930
unsafe_shutdowns : 43
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
(I do notice that ashift is 9 and the sector size is 512, which is not optimal for this device.)
(edit: added uname -r)
(SSDs usually hide the performance impact of actual size of IOs to them well enough that you should weigh the observed performance against the space efficiency that results, IMO.)
So, I would guess this is decryption errors, not actual r/w/c errors, since they're not showing up there.
My guess would be there's some additional stupid caveat around the MAC caveats that said patch you originally were applying to Ubuntu's tree is a workaround for - the caveat is, that patch just works around it when it encounters it, a different change is what stops it from screwing up in the first place. (cc #11294 #13709 #14161 63a26454ba43ca3af5020f25d2e59c7ee15bf13d e257bd481bb77181e475bf35292c1a4b0cb2c57a for prior art.) Of course, those are all just for "I cannot unlock the dataset at all because of that error", not files within it erroring...
If it's giving stat errors, I would guess that there could be a similar issue with not just the objset accounting data? I'm not really sure what we can do about that, though, because the reason we can just ignore it in the objset accounting data case is that we can just regenerate it.
If this is a backup target, what's the source running, in terms of ZFS version? What's the automation that's doing the send/recv?
2.1.x has useful counters you can see in grep . /proc/spl/kstat/kcf/*provider* for when decryption things failed - of course, they were dropped in 2.2.x for some reason, I don't really understand why, but for now at least, you can confirm that's decryption errors.
I would suggest that you look at the zpool events output for those errors, and see what they have to say - in particular, you might also find it useful to look at the (git, so it can look at encrypted objects) zdb output for objects 0 and 0x20 (so 32) on that dataset that's erroring, and see if it reports anything interesting. (0 could actually mean 0 or somewhere in the metadata without more precision; 0x20 appears to be the system attribute node on most of the more recent datasets I've made and a normal directory object on some others, so it'd be good to find out what it is here)
It also, thinking about it, wouldn't completely astonish me if those decryption errors were transient and went away after it decided to regenerate the accounting information, but that's just wild speculation, so if you scrub and they go away, and it doesn't come back, great, that's a nice cleanup that should probably be applied to how that's reported, but that's better than getting "???" for files.
If you're still seeing the ??? for some files, though, then we should probably dig even more into it.
Thanks for the replies :)
If this is a backup target, what's the source running, in terms of ZFS version? What's the automation that's doing the send/recv?
This is a backup source. The target is running 2.1.5, but we're not actually getting any data there (when it fails), so I don't think that can be an issue. The automation consists of: zfs snapshot, zfs set planb:owner=..., zfs send -I. The data is sent unencrypted (not --raw, it's encrypted with a different key on the target).
so if you scrub and they go away, and it doesn't come back, great
So far this only went away by removing the snapshot and doing a scrub afterwards.
If you're still seeing the ??? for some files, though, then we should probably dig even more into it.
# cd /var/lib/mysql/redacted/.zfs/snapshot/planb-20231126T1900Z/
# ls -a
. ..
# cd ..
# ls -l
total 8
drwxr-xr-x 7 mysql mysql 21 nov 14 13:56 planb-20231126T0800Z
drwxr-xr-x 7 mysql mysql 21 nov 14 13:56 planb-20231126T1700Z
drwxr-xr-x 7 mysql mysql 21 nov 14 13:56 planb-20231126T1800Z
drwxr-xr-x 7 mysql mysql 21 nov 14 13:56 planb-20231126T1900Z
drwxr-xr-x 7 mysql mysql 21 nov 14 13:56 planb-20231126T2100Z
# ls planb-20231126T1800Z
aria_log.00000001 ddl.log galera.cache
...
# ls planb-20231126T2100Z
aria_log.00000001 ddl.log galera.cache
...
# ls planb-20231126T1900Z/
ls: cannot access 'planb-20231126T1900Z/aria_log.00000001': Input/output error
ls: cannot access 'planb-20231126T1900Z/aria_log_control': Input/output error
aria_log.00000001 ddl.log galera.cache
...
Weird. The files weren't there. And then they were, but were unreadable.
Still exactly 2 errors at this point.
# cat planb-20231126T1900Z/galera.cache >/dev/null
cat: planb-20231126T1900Z/galera.cache: Input/output error
But this caused a third error to appear. (And later a fourth.)
errors: Permanent errors have been detected in the following files:
tank/mysql/redacted/data@planb-20231126T1900Z:<0x0> # 1
tank/mysql/redacted/data@planb-20231126T1900Z:<0x20> # 2
tank/mysql/redacted/data@planb-20231126T1900Z:<0x21> # 4
tank/mysql/redacted/data@planb-20231126T1900Z:<0x382> # 3
Here are the counters:
# grep . /proc/spl/kstat/kcf/*provider*
2 1 0x01 4 1088 3737526914 1069440114633433
name type data
kcf_ops_total 4 5954800672
kcf_ops_passed 4 5954800635
kcf_ops_failed 4 37
kcf_ops_returned_busy 4 0
Not sure if these say they are decryption errors or not.
I'll see if I can get some zdb info tomorrow...
kcf_ops_failed is pronounced "I failed to decrypt something".
# ls -l good-0x* bad-0x*
-rw-r--r-- 1 root root 14568 nov 27 13:22 bad-0x0
-rw-r--r-- 1 root root 1131 nov 27 13:22 bad-0x20
-rw-r--r-- 1 root root 1118 nov 27 13:22 bad-0x21
-rw-r--r-- 1 root root 553316 nov 27 13:22 bad-0x382
-rw-r--r-- 1 root root 14568 nov 27 13:22 good-0x0
-rw-r--r-- 1 root root 1131 nov 27 13:22 good-0x20
-rw-r--r-- 1 root root 1118 nov 27 13:22 good-0x21
-rw-r--r-- 1 root root 553316 nov 27 13:22 good-0x382
Types of the aforementioned objects:
# grep Object bad-0x* -A1
bad-0x0: Object lvl iblk dblk dsize dnsize lsize %full type
bad-0x0- 0 6 128K 16K 910K 512 784K 15.05 DMU dnode (K=inherit) (Z=inherit=inherit)
--
bad-0x20: Object lvl iblk dblk dsize dnsize lsize %full type
bad-0x20- 32 1 128K 512 1K 512 512 100.00 SA master node (K=inherit) (Z=inherit=inherit)
--
bad-0x21: Object lvl iblk dblk dsize dnsize lsize %full type
bad-0x21- 33 1 128K 512 1K 512 512 100.00 ZFS delete queue (K=inherit) (Z=inherit=inherit)
--
bad-0x382: Object lvl iblk dblk dsize dnsize lsize %full type
bad-0x382- 898 4 128K 16K 10.5M 512 1.00G 2.83 ZFS plain file (K=inherit) (Z=inherit=inherit)
So, DMU dnode (if 0x0 really is 0x0), SA master node, ZFS delete queue and ZFS plain file respectively.
Very few differences between a good and a bad snapshot:
diff -U10
--- /dev/fd/63 2023-11-27 13:31:07.049708006 +0100
+++ /dev/fd/62 2023-11-27 13:31:07.049708006 +0100
@@ -1,23 +1,23 @@
Unlocked encryption root: tank/mysql
-Dataset tank/mysql/redacted/data@planb-20231126T1800Z [ZPL], ID 67719, cr_txg 15383768, 32.7M, 236 objects, rootbp DVA[0]=<0:2c5a0f94200:1000> DVA[1]=<0:1a4abdd8200:1000> [L0 DMU objset] fletcher4 uncompressed authenticated LE contiguous unique double size=1000L/1000P birth=15383310L/15383310P fill=236 cksum=000000272218b544:0000703a9a4e8e4a:00a9c470070c10f1:b2d3805de20d379e
+Dataset tank/mysql/redacted/data@planb-20231126T1900Z [ZPL], ID 67919, cr_txg 15384510, 32.7M, 236 objects, rootbp DVA[0]=<0:2c5a1906800:1000> DVA[1]=<0:1a4ac5f4200:1000> [L0 DMU objset] fletcher4 uncompressed authenticated LE contiguous unique double size=1000L/1000P birth=15384359L/15384359P fill=236 cksum=0000002477cce424:000067457efc5aaa:009aa63ca1cb117d:a1bc5cd4e7913680
Object lvl iblk dblk dsize dnsize lsize %full type
0 6 128K 16K 910K 512 784K 15.05 DMU dnode (K=inherit) (Z=inherit=inherit)
dnode flags: USED_BYTES
dnode maxblkid: 48
Indirect blocks:
- 0 L5 DVA[0]=<0:37cf5cf2800:400> DVA[1]=<0:32ee7c13200:400> [L5 DMU dnode] fletcher4 lz4 indirect-MAC LE contiguous unique double size=20000L/400P birth=15383310L/15383310P fill=236 cksum=0021883aba978620:08746ad2de7f84c8:27c9e69634657ef1:6c3cb47841c67153
- 0 L4 DVA[0]=<0:37cf5cf1c00:400> DVA[1]=<0:32ee7c12600:400> [L4 DMU dnode] fletcher4 lz4 indirect-MAC LE contiguous unique double size=20000L/400P birth=15383310L/15383310P fill=236 cksum=001fcc0ff14c63d7:07e71bba680fe649:c7072401c80680a7:a3d26ac26c905e81
- 0 L3 DVA[0]=<0:c0f6ac5c00:400> DVA[1]=<0:18e43275200:400> [L3 DMU dnode] fletcher4 lz4 indirect-MAC LE contiguous unique double size=20000L/400P birth=15383310L/15383310P fill=236 cksum=00207528df08d374:081c2e5858bea482:51ecef02902b8a1b:ae0266f02e9cb0ba
- 0 L2 DVA[0]=<0:c0f6ab6400:400> DVA[1]=<0:18e43265a00:400> [L2 DMU dnode] fletcher4 lz4 indirect-MAC LE contiguous unique double size=20000L/400P birth=15383310L/15383310P fill=236 cksum=0021580e1bc7ddf0:0864a97316f80a0d:97de067f8b082f74:dca97f8ddf677295
- 0 L1 DVA[0]=<0:37cf5ce5000:c00> DVA[1]=<0:32ee7c07e00:c00> [L1 DMU dnode] fletcher4 lz4 indirect-MAC LE contiguous unique double size=20000L/c00P birth=15383310L/15383310P fill=236 cksum=01e34853fecd8757:66806faa16ffadb5:49e66e57db640d6b:15825b09a3f5705f
+ 0 L5 DVA[0]=<0:37cf9665800:400> DVA[1]=<0:32eeab5ce00:400> [L5 DMU dnode] fletcher4 lz4 indirect-MAC LE contiguous unique double size=20000L/400P birth=15384359L/15384359P fill=236 cksum=002228352ca04564:08a5d17947b85597:0ccabe7d819edce6:95d5c21fc13dc12b
+ 0 L4 DVA[0]=<0:37cf9665000:400> DVA[1]=<0:32eeab5c600:400> [L4 DMU dnode] fletcher4 lz4 indirect-MAC LE contiguous unique double size=20000L/400P birth=15384359L/15384359P fill=236 cksum=001ff784e827a90f:07f4e1a60cbb427e:2a6ec58bb7496359:6a43c2255d8d9d1e
+ 0 L3 DVA[0]=<0:c0fce17c00:400> DVA[1]=<0:18e4802c800:400> [L3 DMU dnode] fletcher4 lz4 indirect-MAC LE contiguous unique double size=20000L/400P birth=15384359L/15384359P fill=236 cksum=002040bb5053a2f7:080c690334c5282d:d01d415fb012f8c6:59cc9c9b32be5868
+ 0 L2 DVA[0]=<0:c0fce17400:400> DVA[1]=<0:18e4802c000:400> [L2 DMU dnode] fletcher4 lz4 indirect-MAC LE contiguous unique double size=20000L/400P birth=15384359L/15384359P fill=236 cksum=00219cc1f99fa0a4:0879382731e5452e:1f4e8e1f7d7b8f0a:350412897ab2a388
+ 0 L1 DVA[0]=<0:37cf9663400:c00> DVA[1]=<0:32eeab5aa00:c00> [L1 DMU dnode] fletcher4 lz4 indirect-MAC LE contiguous unique double size=20000L/c00P birth=15384359L/15384359P fill=236 cksum=01e653ccbdfb3279:67848cf16194940f:cb6ec92779c9e6b9:b848528869df635b
0 L0 DVA[0]=<0:bad7832000:4000> DVA[1]=<0:32db11e8200:4000> salt=eb7c19fbb31a2896 iv=2f87cb9113bab786:6af1f83e [L0 DMU dnode] fletcher4 uncompressed encrypted LE contiguous unique double size=4000L/4000P birth=15372102L/15372102P fill=2 cksum=10a4b3a6b26d3f82:2fb64beaa71a7a76:410ee197fce45d3d:980172ad786a4bc0
4000 L0 DVA[0]=<0:27801d56000:4000> DVA[1]=<0:32c58286800:4000> salt=eb7c19fbb31a2896 iv=d6ddfd1c9e77e68e:efa270c7 [L0 DMU dnode] fletcher4 uncompressed encrypted LE contiguous unique double size=4000L/4000P birth=15162731L/15162731P fill=5 cksum=1fd388db45559956:30146d5cd13e9d0a:2baf18f02849834d:32c82535f5cd7400
8000 L0 HOLE [L0 DMU dnode] size=4000L birth=3119973L
c000 L0 HOLE [L0 DMU dnode] size=4000L birth=3119973L
10000 L0 DVA[0]=<0:27801d62200:4000> DVA[1]=<0:32c5ec2a000:4000> salt=eb7c19fbb31a2896 iv=10071c96d02a316f:b4258f21 [L0 DMU dnode] fletcher4 uncompressed encrypted LE contiguous unique double size=4000L/4000P birth=15162731L/15162731P fill=5 cksum=33b9bb555a2db2ea:034046a801e4c5a7:4f17e8987bdbb629:1bbcff71af7ab2a0
14000 L0 HOLE [L0 DMU dnode] size=4000L birth=11808645L
20000 L0 HOLE [L0 DMU dnode] size=4000L birth=15162730L
24000 L0 HOLE [L0 DMU dnode] size=4000L birth=12297274L
30000 L0 DVA[0]=<0:27801d5a000:4000> DVA[1]=<0:32c5ec22000:4000> salt=eb7c19fbb31a2896 iv=410cb0cb1980d636:fe97b940 [L0 DMU dnode] fletcher4 uncompressed encrypted LE contiguous unique double size=4000L/4000P birth=15162731L/15162731P fill=4 cksum=2d3a255cd9a8482c:c86e8448ddab9058:8f7f99229932921f:64ab33cc0b5f17b1
40000 L0 DVA[0]=<0:27801d6a200:4000> DVA[1]=<0:32c5ec32000:4000> salt=eb7c19fbb31a2896 iv=d591da11453a1bc5:ad40fa7f [L0 DMU dnode] fletcher4 uncompressed encrypted LE contiguous unique double size=4000L/4000P birth=15162731L/15162731P fill=6 cksum=74d4b0e567899c06:7c45f30e8fdbb7ab:0ab0494041483a4c:fc522d3e91bed75c
@@ -36,21 +36,21 @@
78000 L0 DVA[0]=<0:2ba1bd8f200:4000> DVA[1]=<0:3100b807200:4000> salt=e167adc1ca149ab1 iv=1b9ffcf72e79e38d:d14e642b [L0 DMU dnode] fletcher4 uncompressed encrypted LE contiguous unique double size=4000L/4000P birth=11810236L/11810236P fill=32 cksum=f6d80dd85118aec9:ab26df98d1f55f1b:76172cf67f8d08c1:14fbab103a4e05f8
7c000 L0 DVA[0]=<0:2ba1bd87200:4000> DVA[1]=<0:3100b7ede00:4000> salt=e167adc1ca149ab1 iv=d8f2a0ee9b90c397:3b2ff543 [L0 DMU dnode] fletcher4 uncompressed encrypted LE contiguous unique double size=4000L/4000P birth=11810236L/11810236P fill=32 cksum=f9293bc5f89a9c23:ce265c050ec77d32:7f92a701d169fb1e:1e544b185533d117
80000 L0 DVA[0]=<0:27801d95200:4000> DVA[1]=<0:32c6bd61000:4000> salt=eb7c19fbb31a2896 iv=2b9b2281220542ad:972dff7d [L0 DMU dnode] fletcher4 uncompressed encrypted LE contiguous unique double size=4000L/4000P birth=15162731L/15162731P fill=3 cksum=14449b508874abb7:41ebe129aacd0215:63f756d34543a249:5c0febe7d2a6d8fb
84000 L0 DVA[0]=<0:2ba1bd4f200:4000> DVA[1]=<0:3100b792e00:4000> salt=e167adc1ca149ab1 iv=3ef76e969849bc50:f323e49f [L0 DMU dnode] fletcher4 uncompressed encrypted LE contiguous unique double size=4000L/4000P birth=11810236L/11810236P fill=2 cksum=000cfae50a150c70:02ff8a844c971b68:7c7264f5fc09adf3:ef72083eb658e3cb
88000 L0 DVA[0]=<0:2ba1bd53200:4000> DVA[1]=<0:3100b796e00:4000> salt=e167adc1ca149ab1 iv=33385b1d4634dac8:36f1548f [L0 DMU dnode] fletcher4 uncompressed encrypted LE contiguous unique double size=4000L/4000P birth=11810236L/11810236P fill=1 cksum=145bd1260c2aac57:74b453df36aaa4d5:02044e7780267598:89f54c5a2213f6b7
90000 L0 HOLE [L0 DMU dnode] size=4000L birth=15162392L
98000 L0 HOLE [L0 DMU dnode] size=4000L birth=11810236L
9c000 L0 DVA[0]=<0:bad783a000:4000> DVA[1]=<0:32db11f0200:4000> salt=eb7c19fbb31a2896 iv=43a4e510a8d6031:d3bba4cd [L0 DMU dnode] fletcher4 uncompressed encrypted LE contiguous unique double size=4000L/4000P birth=15372102L/15372102P fill=6 cksum=0cd48fbe419155fb:30d1a9a051604b3b:8500cc17f5f66f43:b437ee146c0f5f1d
a0000 L0 DVA[0]=<0:31807f26400:4000> DVA[1]=<0:32ff4e12c00:4000> salt=eb7c19fbb31a2896 iv=c827d4d598c3867c:cedf771c [L0 DMU dnode] fletcher4 uncompressed encrypted LE contiguous unique double size=4000L/4000P birth=15290191L/15290191P fill=11 cksum=6185877ea2418bd5:222c929592d9eb00:92f939cd2a1b806b:7b3e6f157b2d4a92
a4000 L0 DVA[0]=<0:3480012ba00:4000> DVA[1]=<0:3100fc7bc00:4000> salt=e167adc1ca149ab1 iv=3a22658a1f4621b6:e8d06d08 [L0 DMU dnode] fletcher4 uncompressed encrypted LE contiguous unique double size=4000L/4000P birth=11821747L/11821747P fill=3 cksum=006fadbed3c91583:36fd432725977eae:1c0cc6a22bf83b8f:3b5eb89fe45e8377
- a8000 L0 DVA[0]=<0:37c74ab0000:4000> DVA[1]=<0:32dfd1e8c00:4000> salt=eb7c19fbb31a2896 iv=a24d26b5cf82ef5d:9bf7287e [L0 DMU dnode] fletcher4 uncompressed encrypted LE contiguous unique double size=4000L/4000P birth=15383310L/15383310P fill=4 cksum=0a33d7910401ebe8:1e4554f1ae028764:da75461794c13e87:fc0b65372d24be4f
+ a8000 L0 DVA[0]=<0:37c76dca400:4000> DVA[1]=<0:32e00101000:4000> salt=eb7c19fbb31a2896 iv=6773bb9233db20ee:a0fb63fd [L0 DMU dnode] fletcher4 uncompressed encrypted LE contiguous unique double size=4000L/4000P birth=15384359L/15384359P fill=4 cksum=0a72e97ee13adb57:eb572499fb52536b:c70e94cfc2af90f1:cb4154dbe903389e
ac000 L0 DVA[0]=<0:2ba146a8200:4000> DVA[1]=<0:31003a98c00:4000> salt=a16006a52ff4bcb3 iv=e4fb70423efcf77d:cd7ef7cd [L0 DMU dnode] fletcher4 uncompressed encrypted LE contiguous unique double size=4000L/4000P birth=11808929L/11808929P fill=1 cksum=0fc32f2a802f5a89:aa52acd4626fdd16:f13d225c237c09da:b1bd66bad85a01cb
b0000 L0 DVA[0]=<0:2ba1bd43200:4000> DVA[1]=<0:3100b784c00:4000> salt=e167adc1ca149ab1 iv=faff6510978c06e3:8718e667 [L0 DMU dnode] fletcher4 uncompressed encrypted LE contiguous unique double size=4000L/4000P birth=11810236L/11810236P fill=8 cksum=7697269badb4ad19:4f1b154bae0f134c:3215c021e33654e3:e35095d168275459
b4000 L0 HOLE [L0 DMU dnode] size=4000L birth=12297059L
c0000 L0 DVA[0]=<0:2ba1bd83200:4000> DVA[1]=<0:3100b7e9e00:4000> salt=e167adc1ca149ab1 iv=fb6d2b4d0adaeb99:64cac0a8 [L0 DMU dnode] fletcher4 uncompressed encrypted LE contiguous unique double size=4000L/4000P birth=11810236L/11810236P fill=17 cksum=e0c180a28ac75f62:64da2fdc10b0ff74:72e85a0c499ed3b0:5cd03397a5666fe6
segment [0000000000000200, 0000000000000400) size 512
segment [0000000000000a00, 0000000000000c00) size 512
segment [0000000000004000, 0000000000004a00) size 2.50K
segment [0000000000011000, 0000000000011a00) size 2.50K
segment [0000000000030000, 0000000000030200) size 512
@@ -89,51 +89,51 @@
segment [00000000000a7800, 00000000000a7e00) size 1.50K
segment [00000000000a9800, 00000000000a9a00) size 512
segment [00000000000ab000, 00000000000ab200) size 512
segment [00000000000aba00, 00000000000abe00) size 1K
segment [00000000000acc00, 00000000000ace00) size 512
segment [00000000000b0400, 00000000000b1200) size 3.50K
segment [00000000000b2600, 00000000000b2800) size 512
segment [00000000000c0000, 00000000000c2200) size 8.50K
Unlocked encryption root: tank/mysql
-Dataset tank/mysql/redacted/data@planb-20231126T1800Z [ZPL], ID 67719, cr_txg 15383768, 32.7M, 236 objects, rootbp DVA[0]=<0:2c5a0f94200:1000> DVA[1]=<0:1a4abdd8200:1000> [L0 DMU objset] fletcher4 uncompressed authenticated LE contiguous unique double size=1000L/1000P birth=15383310L/15383310P fill=236 cksum=000000272218b544:0000703a9a4e8e4a:00a9c470070c10f1:b2d3805de20d379e
+Dataset tank/mysql/redacted/data@planb-20231126T1900Z [ZPL], ID 67919, cr_txg 15384510, 32.7M, 236 objects, rootbp DVA[0]=<0:2c5a1906800:1000> DVA[1]=<0:1a4ac5f4200:1000> [L0 DMU objset] fletcher4 uncompressed authenticated LE contiguous unique double size=1000L/1000P birth=15384359L/15384359P fill=236 cksum=0000002477cce424:000067457efc5aaa:009aa63ca1cb117d:a1bc5cd4e7913680
Object lvl iblk dblk dsize dnsize lsize %full type
32 1 128K 512 1K 512 512 100.00 SA master node (K=inherit) (Z=inherit=inherit)
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 0
microzap: 512 bytes, 2 entries
REGISTRY = 35
LAYOUTS = 36
Indirect blocks:
0 L0 DVA[0]=<0:100009a600:200> DVA[1]=<0:140009a600:200> salt=cea7d00cc51104d3 iv=7b5d5ca08bc9b078:b22797 [L0 SA master node] fletcher4 uncompressed encrypted LE contiguous unique double size=200L/200P birth=69L/69P fill=1 cksum=0002d8a2050b18ee:005e8c5c4267f84c:4d78b9600dfa1e49:5e4a20d97b02bb7d
segment [0000000000000000, 0000000000000200) size 512
Unlocked encryption root: tank/mysql
-Dataset tank/mysql/redacted/data@planb-20231126T1800Z [ZPL], ID 67719, cr_txg 15383768, 32.7M, 236 objects, rootbp DVA[0]=<0:2c5a0f94200:1000> DVA[1]=<0:1a4abdd8200:1000> [L0 DMU objset] fletcher4 uncompressed authenticated LE contiguous unique double size=1000L/1000P birth=15383310L/15383310P fill=236 cksum=000000272218b544:0000703a9a4e8e4a:00a9c470070c10f1:b2d3805de20d379e
+Dataset tank/mysql/redacted/data@planb-20231126T1900Z [ZPL], ID 67919, cr_txg 15384510, 32.7M, 236 objects, rootbp DVA[0]=<0:2c5a1906800:1000> DVA[1]=<0:1a4ac5f4200:1000> [L0 DMU objset] fletcher4 uncompressed authenticated LE contiguous unique double size=1000L/1000P birth=15384359L/15384359P fill=236 cksum=0000002477cce424:000067457efc5aaa:009aa63ca1cb117d:a1bc5cd4e7913680
Object lvl iblk dblk dsize dnsize lsize %full type
33 1 128K 512 1K 512 512 100.00 ZFS delete queue (K=inherit) (Z=inherit=inherit)
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 0
microzap: 512 bytes, 0 entries
Indirect blocks:
0 L0 DVA[0]=<0:14802d99200:200> DVA[1]=<0:1a400aa6400:200> salt=eb7c19fbb31a2896 iv=de39e0c5ca8190b8:42588b57 [L0 ZFS delete queue] fletcher4 uncompressed encrypted LE contiguous unique double size=200L/200P birth=15162731L/15162731P fill=1 cksum=0003130907e2f727:0063bc01482517f9:8d6472e84320d255:ddb20b4bdf312488
segment [0000000000000000, 0000000000000200) size 512
Unlocked encryption root: tank/mysql
-Dataset tank/mysql/redacted/data@planb-20231126T1800Z [ZPL], ID 67719, cr_txg 15383768, 32.7M, 236 objects, rootbp DVA[0]=<0:2c5a0f94200:1000> DVA[1]=<0:1a4abdd8200:1000> [L0 DMU objset] fletcher4 uncompressed authenticated LE contiguous unique double size=1000L/1000P birth=15383310L/15383310P fill=236 cksum=000000272218b544:0000703a9a4e8e4a:00a9c470070c10f1:b2d3805de20d379e
+Dataset tank/mysql/redacted/data@planb-20231126T1900Z [ZPL], ID 67919, cr_txg 15384510, 32.7M, 236 objects, rootbp DVA[0]=<0:2c5a1906800:1000> DVA[1]=<0:1a4ac5f4200:1000> [L0 DMU objset] fletcher4 uncompressed authenticated LE contiguous unique double size=1000L/1000P birth=15384359L/15384359P fill=236 cksum=0000002477cce424:000067457efc5aaa:009aa63ca1cb117d:a1bc5cd4e7913680
Object lvl iblk dblk dsize dnsize lsize %full type
898 4 128K 16K 10.5M 512 1.00G 2.83 ZFS plain file (K=inherit) (Z=inherit=inherit)
176 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 65536
path /galera.cache
uid 111
gid 120
atime Thu May 11 15:52:34 2023
I did find the following line in /proc/spl/kstat/zfs/dbgmsg:
2023-11-26T22:35:19+0100 (1701034519) zfs_ctldir.c:1139:zfsctl_snapshot_mount(): Unable to automount /var/lib/mysql/redacted/.zfs/snapshot/planb-20231126T1900Z error=256
That was when doing the ls -a yesterday, which turned up nothing.
The aria_log.00000001 and aria_log_control which showed ??? look like this in zdb:
Object lvl iblk dblk dsize dnsize lsize %full type
34 1 128K 1.50K 1K 512 1.50K 100.00 ZFS directory (K=inherit) (Z=inherit=inherit)
176 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 0
path /
uid 111
gid 120
atime Tue Jul 20 23:05:23 2021
mtime Tue Nov 14 13:56:56 2023
ctime Tue Nov 14 13:56:56 2023
crtime Tue Jul 20 23:05:23 2021
gen 69
mode 40755
size 21
parent 34
links 7
pflags 840800000144
microzap: 1536 bytes, 19 entries
ddl.log = 1273 (type: Regular File)
ibtmp1 = 512 (type: Regular File)
ib_buffer_pool = 896 (type: Regular File)
aria_log.00000001 = 1373 (type: Regular File)
redacted = 1038 (type: Directory)
galera.cache = 898 (type: Regular File)
performance_schema = 934 (type: Directory)
mysql = 1040 (type: Directory)
test = 1037 (type: Directory)
galera.pid = 514 (type: Regular File)
multi-master.info = 660 (type: Regular File)
sys = 936 (type: Directory)
gvwstate.dat = 384 (type: Regular File)
aria_log_control = 1374 (type: Regular File)
ddl_recovery.log = 640 (type: Regular File)
xtrabackup_info = 1382 (type: Regular File)
mysql_upgrade_info = 513 (type: Regular File)
grastate.dat = 5 (type: Regular File)
ibdata1 = 1294 (type: Regular File)
Indirect blocks:
0 L0 DVA[0]=<0:14802d99400:200> DVA[1]=<0:1a400aa6600:200> salt=eb7c19fbb31a2896 iv=91ea3e5bc650043f:146ab2be [L0 ZFS directory] fletcher4 lz4 encrypted LE contiguous unique double size=600L/200P birth=15162731L/15162731P fill=1 cksum=0002c75bc4814940:005986c858f339f9:d63621ecb3d69fed:557039941fa509af
segment [0000000000000000, 0000000000000600) size 1.50K
Object lvl iblk dblk dsize dnsize lsize %full type
1373 2 128K 16K 312K 512 1.45M 100.00 ZFS plain file (K=inherit) (Z=inherit=inherit)
176 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 92
path /aria_log.00000001
uid 111
gid 120
atime Thu May 11 15:55:12 2023
mtime Sun Nov 26 19:49:34 2023
ctime Sun Nov 26 19:49:34 2023
crtime Thu May 11 15:55:12 2023
gen 11808755
mode 100660
size 1515520
parent 34
links 1
pflags 840800000004
Indirect blocks:
0 L1 DVA[0]=<0:37d39d9f600:1a00> DVA[1]=<0:32fcf924400:1a00> [L1 ZFS plain file] fletcher4 lz4 indirect-MAC LE contiguous unique double size=20000L/1a00P birth=15384359L/15384359P fill=93 cksum=133591c4d8e18cd7:08819389ec10b94c:46255c22acab03b7:243bd9c3b057da1f
0 L0 DVA[0]=<0:2b9ee7f8800:600> salt=a16006a52ff4bcb3 iv=d60f57e4331171fb:39c57726 [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/600P birth=11808755L/11808755P fill=1 cksum=00480df48ce2376f:1b36ffdb1bbc8173:24b90b64cde4f631:5f58e161356986f0
4000 L0 DVA[0]=<0:2ba25bfb800:800> salt=a16006a52ff4bcb3 iv=ba410c2136c80647:1ad3fe9c [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/800P birth=11808755L/11808755P fill=1 cksum=00ae75b70020849a:580a60172c11fdc6:eb3fb2b50a981203:a740c033085fe422
8000 L0 DVA[0]=<0:2ba25bfc000:800> salt=a16006a52ff4bcb3 iv=43ac0f67162fc3c3:6df46420 [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/800P birth=11808755L/11808755P fill=1 cksum=00b16fcf35454983:59a76676ec6c67f4:326ef34f738dfd38:ab2ab61815713ebc
... truncated ...
Object lvl iblk dblk dsize dnsize lsize %full type
1374 1 128K 512 512 512 512 100.00 ZFS plain file (K=inherit) (Z=inherit=inherit)
176 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 0
path /aria_log_control
uid 111
gid 120
atime Thu May 11 15:55:12 2023
mtime Sun Nov 26 19:49:34 2023
ctime Sun Nov 26 19:49:34 2023
crtime Thu May 11 15:55:12 2023
gen 11808755
mode 100660
size 52
parent 34
links 1
pflags 840800000004
Indirect blocks:
0 L0 DVA[0]=<0:234e715dc00:200> salt=eb7c19fbb31a2896 iv=7d98a4386a558b60:fd266813 [L0 ZFS plain file] fletcher4 uncompressed encrypted LE contiguous unique single size=200L/200P birth=15384359L/15384359P fill=1 cksum=0002c2484e02f9e1:005a1eeabbf1c802:0e2f9fce11c32d51:0f35cf920387c23c
segment [0000000000000000, 0000000000000200) size 512
Objects 34, 1373 and 1374 diff between the previous snapshot and the failing one:
diff -U10 good-34-1373-1374 bad-34-1373-1374
--- good-34-1373-1374 2023-11-27 13:42:57.408596765 +0100
+++ bad-34-1373-1374 2023-11-27 13:43:05.480674224 +0100
@@ -1,12 +1,12 @@
Unlocked encryption root: tank/mysql
-Dataset tank/mysql/redacted/data@planb-20231126T1800Z [ZPL], ID 67719, cr_txg 15383768, 32.7M, 236 objects, rootbp DVA[0]=<0:2c5a0f94200:1000> DVA[1]=<0:1a4abdd8200:1000> [L0 DMU objset] fletcher4 uncompressed authenticated LE contiguous unique double size=1000L/1000P birth=15383310L/15383310P fill=236 cksum=000000272218b544:0000703a9a4e8e4a:00a9c470070c10f1:b2d3805de20d379e
+Dataset tank/mysql/redacted/data@planb-20231126T1900Z [ZPL], ID 67919, cr_txg 15384510, 32.7M, 236 objects, rootbp DVA[0]=<0:2c5a1906800:1000> DVA[1]=<0:1a4ac5f4200:1000> [L0 DMU objset] fletcher4 uncompressed authenticated LE contiguous unique double size=1000L/1000P birth=15384359L/15384359P fill=236 cksum=0000002477cce424:000067457efc5aaa:009aa63ca1cb117d:a1bc5cd4e7913680
Object lvl iblk dblk dsize dnsize lsize %full type
34 1 128K 1.50K 1K 512 1.50K 100.00 ZFS directory (K=inherit) (Z=inherit=inherit)
176 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 0
path /
uid 111
gid 120
atime Tue Jul 20 23:05:23 2021
@@ -47,31 +47,31 @@
Object lvl iblk dblk dsize dnsize lsize %full type
1373 2 128K 16K 312K 512 1.45M 100.00 ZFS plain file (K=inherit) (Z=inherit=inherit)
176 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 92
path /aria_log.00000001
uid 111
gid 120
atime Thu May 11 15:55:12 2023
- mtime Sun Nov 26 18:23:34 2023
- ctime Sun Nov 26 18:23:34 2023
+ mtime Sun Nov 26 19:49:34 2023
+ ctime Sun Nov 26 19:49:34 2023
crtime Thu May 11 15:55:12 2023
gen 11808755
mode 100660
size 1515520
parent 34
links 1
pflags 840800000004
Indirect blocks:
- 0 L1 DVA[0]=<0:37d31f40c00:1a00> DVA[1]=<0:32fc83dc800:1a00> [L1 ZFS plain file] fletcher4 lz4 indirect-MAC LE contiguous unique double size=20000L/1a00P birth=15383310L/15383310P fill=93 cksum=133607619cd6fdff:08a161c5edcbeefd:4c07da2c5dca0bef:35f35ecb744ba729
+ 0 L1 DVA[0]=<0:37d39d9f600:1a00> DVA[1]=<0:32fcf924400:1a00> [L1 ZFS plain file] fletcher4 lz4 indirect-MAC LE contiguous unique double size=20000L/1a00P birth=15384359L/15384359P fill=93 cksum=133591c4d8e18cd7:08819389ec10b94c:46255c22acab03b7:243bd9c3b057da1f
0 L0 DVA[0]=<0:2b9ee7f8800:600> salt=a16006a52ff4bcb3 iv=d60f57e4331171fb:39c57726 [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/600P birth=11808755L/11808755P fill=1 cksum=00480df48ce2376f:1b36ffdb1bbc8173:24b90b64cde4f631:5f58e161356986f0
4000 L0 DVA[0]=<0:2ba25bfb800:800> salt=a16006a52ff4bcb3 iv=ba410c2136c80647:1ad3fe9c [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/800P birth=11808755L/11808755P fill=1 cksum=00ae75b70020849a:580a60172c11fdc6:eb3fb2b50a981203:a740c033085fe422
8000 L0 DVA[0]=<0:2ba25bfc000:800> salt=a16006a52ff4bcb3 iv=43ac0f67162fc3c3:6df46420 [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/800P birth=11808755L/11808755P fill=1 cksum=00b16fcf35454983:59a76676ec6c67f4:326ef34f738dfd38:ab2ab61815713ebc
c000 L0 DVA[0]=<0:2ba25bfc800:800> salt=a16006a52ff4bcb3 iv=a4027b64af32edc1:a7515d1d [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/800P birth=11808755L/11808755P fill=1 cksum=009ed1f0d7172e8f:4e64d1fd08f36d39:57d2811b6c4a0210:a85a4819cfa90bb0
10000 L0 DVA[0]=<0:2ba25bfd000:800> salt=a16006a52ff4bcb3 iv=2edbdfa21cec0373:9f7a0520 [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/800P birth=11808755L/11808755P fill=1 cksum=00a64ee21be65790:5444901bfad56f2c:4879b5fdff176ed9:c4fe1e178293eb9b
14000 L0 DVA[0]=<0:2ba25bfd800:800> salt=a16006a52ff4bcb3 iv=b0fb7c2f54c7b99c:35b4774c [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/800P birth=11808755L/11808755P fill=1 cksum=00b24533eda62b71:59d82b47fd32f89d:716d6124ac70325d:5501b55854791282
18000 L0 DVA[0]=<0:2ba25bfe000:800> salt=a16006a52ff4bcb3 iv=17649f0cfdc368c:94d62662 [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/800P birth=11808755L/11808755P fill=1 cksum=00aeff4bb9d45b59:5816188f23601001:5de8b76cdebc271c:697be53dd091db8f
1c000 L0 DVA[0]=<0:2ba25bfe800:800> salt=a16006a52ff4bcb3 iv=2eca68132f3dd2e6:6fc27721 [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/800P birth=11808755L/11808755P fill=1 cksum=00ad84ac5a814deb:575b14d5f326d816:1fed7df226dd783f:2cd06b2f61696c41
20000 L0 DVA[0]=<0:2ba25bff000:800> salt=a16006a52ff4bcb3 iv=fb961cb6649ed0b6:42131580 [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/800P birth=11808755L/11808755P fill=1 cksum=00b345ca6892ac7a:5af9626884c6549e:825576dfaaa646ff:c0e7d5d2c0acd8b0
24000 L0 DVA[0]=<0:2ba25bff800:800> salt=a16006a52ff4bcb3 iv=330e80414cec721a:97c0dd92 [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/800P birth=11808755L/11808755P fill=1 cksum=00b02eb060f04c1c:586116933ea7c981:2782951e3b9a9eeb:ea843e208a9fa808
@@ -150,37 +150,37 @@
148000 L0 DVA[0]=<0:231bb75e600:1000> salt=607938b3d7d2a46b iv=a62d7b915f5583c2:3ef02b83 [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/1000P birth=13544554L/13544554P fill=1 cksum=054e8f8750812113:4cc93994b2e6fecb:07a2038218962aef:f6651dd7d3ddde3c
14c000 L0 DVA[0]=<0:19d04de3a00:1000> salt=607938b3d7d2a46b iv=edca6375aceb2365:fd291a2 [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/1000P birth=13742433L/13742433P fill=1 cksum=058ff61cbfc36e0e:993a454f2e3d1bfc:775688ddb2aec077:f8469b9ef0e0ef9e
150000 L0 DVA[0]=<0:150aebde200:1000> salt=607938b3d7d2a46b iv=f5590f43345cfe7d:9ddceb20 [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/1000P birth=13943785L/13943785P fill=1 cksum=0550feef027bf5b7:564ab218ff6d02b7:f38e697cb6f6f9d1:34ddbc74c64f6995
154000 L0 DVA[0]=<0:324961ade00:1000> salt=4eb17d7d320f80a6 iv=fc8bf219a60534e3:a70af3f4 [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/1000P birth=14143325L/14143325P fill=1 cksum=05346adaa24fb29c:26178dac597fc763:ee4d7ef4fb09179e:d5e5f0a601d78350
158000 L0 DVA[0]=<0:10766cd8400:1000> salt=f29c9e6b56b8e079 iv=59dc7c914b856505:5911697a [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/1000P birth=14340298L/14340298P fill=1 cksum=056e876ca4608d24:7f9bcdd539aaeca9:d095464796d349b7:550323666cf0c738
15c000 L0 DVA[0]=<0:ba10339a00:1000> salt=d582100159ee9f0c iv=419e2797b51ba352:53746861 [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/1000P birth=14539888L/14539888P fill=1 cksum=05642a5a65c17e3b:6782f249bbfe55e4:78a2096ff4f65099:34268817543d4e51
160000 L0 DVA[0]=<0:2f057e9c400:1000> salt=c71d1cf36b9e52e6 iv=d386310c37b1abc8:cbf260ba [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/1000P birth=14734863L/14734863P fill=1 cksum=05761de15549f904:71ee2a11c84534a5:648c434f4c55c12b:89c1418f99d3f488
164000 L0 DVA[0]=<0:203813b8000:1000> salt=c71d1cf36b9e52e6 iv=5b1f342fe3052d26:fff300b6 [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/1000P birth=14930857L/14930857P fill=1 cksum=0562d507d6ec98a0:5fbee88d2f4670f6:71dfaf54a377baff:65d86468ac351f04
168000 L0 DVA[0]=<0:3617507cc00:1000> salt=4c76819ef0bdfda5 iv=8c88ff91baf9f46a:13538638 [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/1000P birth=15131803L/15131803P fill=1 cksum=052c81ab79eea7f6:220a7d16fa3fa704:93daf99b08399ca3:ca86683634a585c6
16c000 L0 DVA[0]=<0:229e2486600:1000> salt=eb7c19fbb31a2896 iv=39bfc684e4fa962e:1bce27f [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/1000P birth=15328972L/15328972P fill=1 cksum=0549743d53716b13:466a76fb9c6d638b:77a654cbcd50cfb8:050f0c328bd54410
- 170000 L0 DVA[0]=<0:234e4479e00:600> salt=eb7c19fbb31a2896 iv=bdbcb75cb602c68b:ffbc7dee [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/600P birth=15383310L/15383310P fill=1 cksum=0046d27c23bc84de:1ac97de673eac332:c85e8e83663efed9:f6d658c6697cdf9d
+ 170000 L0 DVA[0]=<0:234e715d600:600> salt=eb7c19fbb31a2896 iv=1615aa2a2d5564ad:d8aec9a2 [L0 ZFS plain file] fletcher4 lz4 encrypted LE contiguous unique single size=4000L/600P birth=15384359L/15384359P fill=1 cksum=00472c143a8a15ea:1a8d27931c546a6c:341e967cc061f060:21da747cf9e82cb7
segment [0000000000000000, 0000000000174000) size 1.45M
Object lvl iblk dblk dsize dnsize lsize %full type
1374 1 128K 512 512 512 512 100.00 ZFS plain file (K=inherit) (Z=inherit=inherit)
176 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 0
path /aria_log_control
uid 111
gid 120
atime Thu May 11 15:55:12 2023
- mtime Sun Nov 26 18:23:34 2023
- ctime Sun Nov 26 18:23:34 2023
+ mtime Sun Nov 26 19:49:34 2023
+ ctime Sun Nov 26 19:49:34 2023
crtime Thu May 11 15:55:12 2023
gen 11808755
mode 100660
size 52
parent 34
links 1
pflags 840800000004
Indirect blocks:
- 0 L0 DVA[0]=<0:234e447a400:200> salt=eb7c19fbb31a2896 iv=501f8bf6a9a38b03:84da206a [L0 ZFS plain file] fletcher4 uncompressed encrypted LE contiguous unique single size=200L/200P birth=15383310L/15383310P fill=1 cksum=0002a4e371958b27:0057772bed0ab651:f66ca0c460379c96:e91fe27cbaf1fbe4
+ 0 L0 DVA[0]=<0:234e715dc00:200> salt=eb7c19fbb31a2896 iv=7d98a4386a558b60:fd266813 [L0 ZFS plain file] fletcher4 uncompressed encrypted LE contiguous unique single size=200L/200P birth=15384359L/15384359P fill=1 cksum=0002c2484e02f9e1:005a1eeabbf1c802:0e2f9fce11c32d51:0f35cf920387c23c
segment [0000000000000000, 0000000000000200) size 512
Today I have more question marks/immediately failing files:
# ls -l
ls: cannot access 'ddl.log': Input/output error
ls: cannot access 'ib_buffer_pool': Input/output error
ls: cannot access 'aria_log.00000001': Input/output error
ls: cannot access 'galera.cache': Input/output error
ls: cannot access 'aria_log_control': Input/output error
total 2286
-????????? ? ? ? ? ? aria_log.00000001
-????????? ? ? ? ? ? aria_log_control
-????????? ? ? ? ? ? ddl.log
-rw-rw---- 1 mysql mysql 16384 nov 14 13:56 ddl_recovery.log
-????????? ? ? ? ? ? galera.cache
-rw-rw---- 1 mysql mysql 7 nov 14 13:56 galera.pid
...
# ls -l galera.*
ls: cannot access 'galera.cache': Input/output error
-rw-rw---- 1 mysql mysql 7 nov 14 13:56 galera.pid
# cat galera.pid
cat: galera.pid: Input/output error
Object lvl iblk dblk dsize dnsize lsize %full type
514 1 128K 512 512 512 512 100.00 ZFS plain file (K=inherit) (Z=inherit=inherit)
176 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 0
path /galera.pid
uid 111
gid 120
atime Tue Nov 14 13:56:56 2023
mtime Tue Nov 14 13:56:56 2023
ctime Tue Nov 14 13:56:56 2023
crtime Tue Nov 14 13:56:56 2023
gen 15162731
mode 100660
size 7
parent 34
links 1
pflags 840800000004
Indirect blocks:
0 L0 DVA[0]=<0:14802d99600:200> salt=eb7c19fbb31a2896 iv=cc4fd4a49efb67fa:28fa32e6 [L0 ZFS plain file] fletcher4 uncompressed encrypted LE contiguous unique single size=200L/200P birth=15162731L/15162731P fill=1 cksum=0002abee1ae25470:0058d8bdb904c79e:a1cd2474ef3ddaba:9846bf62da08af08
segment [0000000000000000, 0000000000000200) size 512
Reading with zdb -r does seem to work:
# ~osso/openzfs-zfs.git/zdb -K $KEY -dbdbdbdbdbdb tank/mysql/redacted/data@planb-20231126T1900Z -r galera.pid /root/galera.pid
Unlocked encryption root: tank/mysql
Unlocked encryption root: tank/mysql
Copying object 514 to file /root/galera.pid
Object 514 is 7 bytes
# hd /root/galera.pid
00000000 31 35 34 31 35 33 0a |154153.|
00000007
In fact, also the aria_log_control which was modified between the good (1800Z) and bad (1900Z) snapshot is readable with zdb -r.
That's about all I can gather from this right now. Let me know if you'd like to see anything else.
My guess remains that it's failing to decrypt the SA and thus refusing you access to the information on the object, but it's not obvious to me why it would be doing that. (zdb, of course, doesn't care about that anyway.)
What's the server hardware that this is in?
I don't immediately recall any bugs with a race in the native encryption code, but, who knows.
What's the server hardware that this is in?
Supermicro X11SCE-F, Intel(R) Xeon(R) E-2236 CPU @ 3.40GHz (6 cores), 4x32GB Micron DDR4 (18ADF4G72AZ-2G6B2), INTEL SSDPE2KX040T8 4TB nvme.
If you have suggestions on debug/tracing code to add to 2.1.13, I can certainly add patches.
There was a second dataset snapshot broken today. I removed both snapshots so the backups are running again. So there's nothing to poke at until the next error pops up.
Checking in.
In November I've had corrupt snapshots on:
- 2023-11-01 (zfs-2.1.5)
- 2023-11-13 (zfs-2.1.13)
- 2023-11-26
- 2023-11-28
- 2023-11-29
- 2023-11-30 (2 corrupt snapshots)
In December (2023-12-01 10:03:02), I updated to ZFS 2.1.14-1osso0:
- no corrupt snapshots yet
No idea if it could be related to the fixes in 2.1.14. I do hope it is and that the snapshots remain working from now on.
I would naively expect it to be unrelated, as I don't immediately see how the fix for #15526 would have impacted this, and if it were #15395 I wouldn't expect the snapshots to have data readable from zdb, but I could be wrong.
Okay. Well, that theory was fun while it lasted :)
errors: Permanent errors have been detected in the following files:
tank/mysql/redacted/log@planb-20231219T0600Z:<0x0>
For different reasons I'll be disabling hourly snapshots on this host. Let me know if you have anything I can work with, then I'll gladly restart something to try and reproduce.
(/proc/spl/kstat/zfs/dbgmsg kept)
Maybe relevant: when no snapshots were made in between, I (still) had to scrub twice to get rid of the error after deleting the failing snapshot.
That's expected - the way the error list in zpool status works, it keeps the previous list after a scrub and only removes things after they don't crop up for two scrubs, I think the original idea being that if you have something flaky, it's useful to remember where it was wrong before even if it read fine this time.
Today I saw https://www.phoronix.com/news/OpenZFS-Encrypt-Corrupt and this looks very similar to #12014 :
- the source snapshots gets corrupt, but the receiving end never gets (the) corrupt snapshots
- removing the corrupt snapshots and doing a bunch of scrubbing removes the errors
- problems started with zfs 2.1.4 (ubuntu jammy)
- problems are only observed where a relatively high send/recv frequency is done (in our case, an hourly script that did a handful of datasets in sequence)