zed excessive logging with class=config_sync
This appears to be an exact duplicate of the closed issue #6667. I have no ability to re-open that issue so based on this guidance (https://stackoverflow.com/questions/21333654/how-to-re-open-an-issue-in-github) I'm opening a new issue and referencing the old one.
System information
| Type | Version/Name |
|---|---|
| Distribution Name | CentOS 7 |
| Distribution Version | 7.4.1708 (Core) |
| Linux Kernel | 3.10.0-693.11.6.el7.x86_64 |
| Architecture | x86_64 |
| ZFS Version | 0.7.5-1 |
| SPL Version | 0.7.5-1 |
Describe the problem you're observing
"zed" is spamming syslog with messages like:
Feb 5 03:39:03 hostname zed: eid=<incrementing number> class=config_sync pool_guid=0xCA08E3DA44C8B6D6
The total volume is pretty impressive, here's last week's count:
# grep -c zed messages-20180205
1154710
Describe how to reproduce the problem
It started after I rebooted the server on Jan 17. That was the first reboot after I had copied a number of ZFS filesystems from another server onto this one. It was also the first reboot after I had installed some HP hardware tools onto this DL360 server.
Jan 17 10:34:39 servername yum[15213]: Installed: hpsmh-7.6.0-11.x86_64
Jan 17 10:34:49 servername yum[15213]: Installed: hpdiags-10.60.2109-2176.x86_64
Jan 17 10:54:52 servername yum[20732]: Installed: hponcfg-4.6.0-0.x86_64
Include any warning/errors/backtraces from the system logs
No backtraces, but the errors are of the same form as previously reported. Here are the very first ones with some context during the boot:
Jan 17 10:56:43 hostname systemd: Started Initialize the iWARP/InfiniBand/RDMA stack in the kernel.
Jan 17 10:56:45 hostname kernel: hpdiagstc:21831 map pfn expected mapping type uncached-minus for [mem 0xdf7fe000-0xdf7fefff], got write-back
Jan 17 10:56:45 hostname kernel: hpdiagstc:21848 map pfn expected mapping type uncached-minus for [mem 0xdf7fe000-0xdf7fefff], got write-back
Jan 17 10:56:45 hostname kernel: hpdiagstc:21865 map pfn expected mapping type uncached-minus for [mem 0xdf7fe000-0xdf7fefff], got write-back
Jan 17 10:56:45 hostname kernel: sdd: sdd1 sdd9
Jan 17 10:56:45 hostname kernel: sde: sde1 sde9
Jan 17 10:56:46 hostname kernel: sdd: sdd1 sdd9
Jan 17 10:56:46 hostname kernel: sde: sde1 sde9
Jan 17 10:56:46 hostname kernel: hpdiagstc:21865 map pfn expected mapping type uncached-minus for [mem 0xdf7fe000-0xdf7fefff], got write-back
Jan 17 10:56:47 hostname kernel: sdd: sdd1 sdd9
Jan 17 10:56:47 hostname kernel: sde: sde1 sde9
Jan 17 10:56:47 hostname zed: eid=42 class=config_sync pool_guid=0xCA08E3DA44C8B6D6
Jan 17 10:56:47 hostname zed: eid=43 class=config_sync pool_guid=0xCA08E3DA44C8B6D6
I traced the initiation of this error spam to starting the HP server diagnostics via the System Management Homepage. When started, I see the following in the logs:
Feb 5 15:41:41 hostname kernel: hpdiagstc:19405 map pfn expected mapping type uncached-minus for [mem 0xdf7fe000-0xdf7fefff], got write-back
Feb 5 15:41:41 hostname kernel: hpdiagstc:19414 map pfn expected mapping type uncached-minus for [mem 0xdf7fe000-0xdf7fefff], got write-back
Feb 5 15:41:42 hostname kernel: hpdiagstc:19439 map pfn expected mapping type uncached-minus for [mem 0xdf7fe000-0xdf7fefff], got write-back
Feb 5 15:41:43 hostname kernel: hpdiagstc:19439 map pfn expected mapping type uncached-minus for [mem 0xdf7fe000-0xdf7fefff], got write-back
Feb 5 15:41:44 hostname zed: eid=17 class=config_sync pool_guid=0xCA08E3DA44C8B6D6
Feb 5 15:41:44 hostname zed: eid=18 class=config_sync pool_guid=0xCA08E3DA44C8B6D6
Feb 5 15:41:45 hostname zed: eid=19 class=config_sync pool_guid=0xCA08E3DA44C8B6D6
... and so on ...
A reboot of the server seems to clear the problem, at least temporarily.
@sbonds thanks for creating a new issue for this. The config_sync event with an increasing eid suggests something on the system is forcing a configuration update. This should primarily happen during import or device addition/removal, not during normal operation. A few questions:
- Are there other events mixed in with the
config_syncones? - Does stopping the ZED halt the creating of new events?
Are there other events mixed in with the config_sync ones?
No.
Does stopping the ZED halt the creating of new events?
Yes. A restart also stops all future (non-queued) events.
Feb 5 18:55:25 hostname systemd: Started ZFS Event Daemon (zed).
Feb 5 18:55:25 hostname systemd: Starting ZFS Event Daemon (zed)...
Feb 5 18:55:25 hostname zed: ZFS Event Daemon 0.7.5-1 (PID 25510)
Feb 5 18:55:25 hostname zed: Processing events since eid=168
Feb 5 18:55:25 hostname zed: eid=169 class=config_sync pool_guid=0xCA08E3DA44C8B6D6
Feb 5 18:55:25 hostname zed: eid=170 class=config_sync pool_guid=0xCA08E3DA44C8B6D6
... no additional log after those two, presumably queued, messages ...
I also did some poking around to determine if this was an HP tools bug or if the HP diags just kicked something off that created the zed cascade. It seems like the latter. I can also start the same cascade with a simple udevadm trigger command that doesn't involve the tools at all.
I have had the same issue. Centos 7, zfs 0.7.13.
May 7 14:36:49 virtual zed: eid=66353 class=config_sync pool_guid=0xBC0B31523322D839 May 7 14:36:49 virtual zed: eid=66354 class=config_sync pool_guid=0xBC0B31523322D839 May 7 14:36:50 virtual zed: eid=66355 class=config_sync pool_guid=0xBC0B31523322D839 May 7 14:36:50 virtual zed: eid=66356 class=config_sync pool_guid=0xBC0B31523322D839 May 7 14:36:50 virtual zed: eid=66357 class=config_sync pool_guid=0xBC0B31523322D839 May 7 14:36:50 virtual zed: eid=66358 class=config_sync pool_guid=0xBC0B31523322D839 May 7 14:36:51 virtual zed: eid=66359 class=config_sync pool_guid=0xBC0B31523322D839 May 7 14:36:51 virtual zed: eid=66360 class=config_sync pool_guid=0xBC0B31523322D839
Restarting zed seems to have cleared it. If it reoccurs, I'll do a better job of capturing data.
My server doing the same thing for a year+, restarting zfs-zed service stop the nonsense for ~1 hour or so. Pegs a cpu and uses a bunch of disk io.
I've scheduled an hourly "service zfs-zed restart" which alleviates but does not fix this.
This issue is still present in debian buster + proxmox. Zed version: zfs-0.8.3-pve1
Still persistent on ubuntu, and also this is creating about a 20% CPU load on my machine. Restarting the service fixes it. Zed version: zfs-zed/now 0.8.4-1~18.04.york0 amd64
@kdar one work-around is to disable autoexpand on the pool
@don-brady Disabled it and restarted the service. Still getting Oct 02 21:38:28 vault zed[4174495]: eid=9959921 class=config_sync pool_guid=0xB6D5D39105D5F808 messages.
Hi! I catched one more state when ZED is start spaming. In my case that was happend when attach drive to pool (single -> mirror).
Also ZED was active read/write on drives in mirror:
# zpool status -L
pool: pool1
state: ONLINE
scan: resilvered 792K in 0h0m with 0 errors on Thu Dec 10 19:43:20 2020
config:
NAME STATE READ WRITE CKSUM
pool1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
errors: No known data errors
# dstat -anr --disk-util -D sda,sdb 10
----total-cpu-usage---- --dsk/sda-----dsk/sdb-- -net/total- ---paging-- ---system-- -net/total- ---io/sda------io/sdb-- sda--sdb-
usr sys idl wai hiq siq| read writ: read writ| recv send| in out | int csw | recv send| read writ: read writ|util:util
0 0 100 0 0 0| 65k 208k: 165k 208k| 0 0 | 0 0 |5770 4932 | 0 0 |0.96 7.67 :7.12 6.41 |0.66:0.76
0 0 99 0 0 0|4828k 14M: 12M 14M|1337k 1204k| 0 0 | 36k 48k|1337k 1204k|71.8 597 : 531 486 |46.8:56.1
0 1 99 0 0 0|4423k 13M: 11M 13M|1331k 1275k| 0 0 | 34k 44k|1331k 1275k|65.4 545 : 489 443 |51.1:60.2
...
0 1 99 0 0 0|7552k 23M: 19M 23M|1147k 1086k| 0 0 | 51k 73k|1151k 1104k| 112 946 : 837 765 |38.9:46.0
0 1 99 0 0 0|8132k 24M: 20M 24M|1290k 1261k| 0 0 | 51k 77k|1249k 1220k| 121 991 : 895 791 |37.6:46.9
0 0 100 0 0 0|2818k 8450k:7223k 8450k|1187k 1143k| 0 0 | 22k 31k|1187k 1143k|41.3 343 : 312 274 |13.0:16.3
0 0 100 0 0 0| 0 0 : 0 0 |1214k 1247k| 0 0 |5718 4746 |1214k 1247k| 0 0 : 0 0 | 0: 0
0 0 100 0 0 0| 0 0 : 0 0 |1126k 1122k| 0 0 |5533 4605 |1126k 1122k| 0 0 : 0 0 | 0: 0
0 0 100 0 0 0| 0 0 : 0 0 |1246k 1233k| 0 0 |5618 4711 |1246k 1233k| 0 0 : 0 0 | 0: 0
After restart ZED all operations was gone (last three dstat states)
Is there any sort of logging, as verbose as need be, that can be enabled so I can help track this down?
It seems to generally startup after any sort of "big" update, but its not 100% and occasionally just decides to happen. It only started once I "upgraded" my zpool to enable all the features because that seemed like a good idea at the time and I'm not planning on going "backwards" to SunOS or an older Debian. I've done nothing other than setup the pool and enable the various features.
I would say the rate is about once a week and the server is only rebooted when necessary for updates.
Restarting the zed service always fixes the problem.
My setup is as follows and extremely vanilla (ZFS pool + SMB share and PLEX server, single user):
| Type | Version/Name |
|---|---|
| Distribution Name | Debian Stable (Buster + backports) |
| Distribution Version | 10 |
| Linux Kernel | 5.10.0-0.bpo.4-amd64 |
| Architecture | x86_64 |
| ZFS/ZED Version | 2.0.3-1~bpo10+1 |
| SPL Version | 2.0.3-1~bpo10+1 |
zpool status: ` pool: tank0 state: ONLINE scan: scrub repaired 0B in 01:47:51 with 0 errors on Sun Apr 4 05:47:53 2021 config:
NAME STATE READ WRITE CKSUM
tank0 ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
wwn-0x500003987bc02157 ONLINE 0 0 0
wwn-0x5000039a6b800076 ONLINE 0 0 0
wwn-0x5000039a6b800079 ONLINE 0 0 0
wwn-0x5000039a6b880065 ONLINE 0 0 0
wwn-0x5000039a6ba00341 ONLINE 0 0 0
wwn-0x5000039a5c70022f ONLINE 0 0 0
wwn-0x5000cca249d0b1de ONLINE 0 0 0
wwn-0x5000cca269d65227 ONLINE 0 0 0
errors: No known data errors `
Also have seen this on Ubuntu 20.04. In our case it was a backup server that just receives hourly snapshots from a production pool. So most of the time the pools are idle. Except I noticed one of them wasn't because the pool is on AWS sc1 volumes and I noticed that their "burst balance" was declining when we weren't running anything. zpool iostat confirmed there was I/O to the pool and iotop implicated zed. The messages in syslog lead me here. "systemctl restart zed" stopped the I/O and messages in this case.
It looks like it may have been triggered in this case after the pool was expanded after it ran out of space. I.E. the pool filled up, I expanded the EBS volumes, then (eventually, autoexpand doesn't seem to always work) got the pool to expand. The class=config_sync messages seem to have started around the time I was doing that work.
I just saw this yesterday on zfs 2.1.1 on ubuntu 20.04. Every second was getting the config_sync log messages. A small sample:
syslog.1:Oct 16 23:32:10 Home-Server zed: eid=36362 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:10 Home-Server zed: eid=36363 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:10 Home-Server zed: eid=36364 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:11 Home-Server zed: eid=36365 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:11 Home-Server zed: eid=36366 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:11 Home-Server zed: eid=36367 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:11 Home-Server zed: eid=36368 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:12 Home-Server zed: eid=36369 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:12 Home-Server zed: eid=36370 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:13 Home-Server zed: eid=36371 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:13 Home-Server zed: eid=36372 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:13 Home-Server zed: eid=36373 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:13 Home-Server zed: eid=36374 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:14 Home-Server zed: eid=36375 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:14 Home-Server zed: eid=36376 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:14 Home-Server zed: eid=36377 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:14 Home-Server zed: eid=36378 class=config_sync pool='Raid'
Restarting zed stopped the issue for me. I've resorted to cron.hourly to restart it. Not using HP hardware, no events like attaching et al, just normal operation and it (seemingly randomly) starts happening. Was not able to figure out what starts it but it never goes away without a zed restart.
I just came back from vacation to find the same thing happening on my system. I can't tell when the problem started because zed messages have completely filled the journal to the point that only 2 days' worth of logs are present.
journalctl -l | grep zed | wc -l 368015
I have a weekly cron job that monitors the # LBAs written to my SLOG SSD so that I can tell if it's approaching the rated TBW lifespan. In a typical idle week, this drive will write about 2GB/day. Since 12/16, it looks like ~500GB have been written which is a huge amount given the system should have been idle as I was on vacation.
I too am experiencing this issue.
journalctl -l | grep zed | wc -l 2639766
ubuntu 21.10 Linux 5.13.0-27-generic on x86_64 zfs-2.0.6-1ubuntu2 zfs-kmod-2.0.6-1ubuntu2
For me, this issue comes back on reboot, but restarting the zed service stops it indefinitely. However, I've found that whenever I open webmin, The issue immediately comes back.
I have the same issue, #13070 Feb 06 19:20:53 shareserver zed[6892]: eid=36 class=config_sync pool='xpool3' Feb 06 19:20:53 shareserver zed[6891]: eid=37 class=config_sync pool='xpool3' Feb 06 19:20:54 shareserver zed[7545]: eid=38 class=config_sync pool='xpool3' Feb 06 19:20:54 shareserver zed[7559]: eid=39 class=config_sync pool='xpool3' Feb 06 19:20:55 shareserver zed[7940]: eid=40 class=config_sync pool='xpool1' Feb 06 19:20:55 shareserver zed[7947]: eid=41 class=config_sync pool='xpool1' Feb 06 19:20:57 shareserver zed[8619]: eid=42 class=config_sync pool='xpool2' Feb 06 19:20:57 shareserver zed[8620]: eid=43 class=config_sync pool='xpool2' Feb 06 19:20:58 shareserver zed[8850]: eid=44 class=config_sync pool='xpool3' Feb 06 19:20:58 shareserver zed[8851]: eid=45 class=config_sync pool='xpool3' Feb 06 19:20:59 shareserver zed[9080]: eid=46 class=config_sync pool='xpool3' Feb 06 19:20:59 shareserver zed[9081]: eid=47 class=config_sync pool='xpool3'
After startup, i see these messages for 10 minutes straight. After all goes to 0 mb/s for the following hours. It happens only at startup.
@behlendorf can you please give a look at it? This strange behaviour seems very old.
https://github.com/openzfs/zfs/issues/7366#issuecomment-377401824
At this link there is the solution. At least for me.
Setting autoexpand to off to all pools resolved the problem.
journalctl | grep zed
Feb 06 22:10:52 shareserver systemd[1]: Started ZFS Event Daemon (zed). Feb 06 22:10:52 shareserver zed[3386]: ZFS Event Daemon 2.1.2-1 (PID 3386) Feb 06 22:10:52 shareserver zed[3386]: Processing events since eid=0 Feb 06 22:10:52 shareserver zed[3409]: eid=2 class=config_sync pool='xpool1' Feb 06 22:10:52 shareserver zed[3412]: eid=3 class=pool_import pool='xpool1' Feb 06 22:10:52 shareserver zed[3430]: eid=5 class=config_sync pool='xpool1' Feb 06 22:10:52 shareserver zed[3439]: eid=9 class=config_sync pool='xpool2' Feb 06 22:10:52 shareserver zed[3441]: eid=10 class=pool_import pool='xpool2' Feb 06 22:10:52 shareserver zed[3454]: eid=12 class=config_sync pool='xpool2' Feb 06 22:10:52 shareserver zed[3461]: eid=14 class=config_sync pool='xpool3' Feb 06 22:10:52 shareserver zed[3468]: eid=15 class=pool_import pool='xpool3' Feb 06 22:10:52 shareserver zed[3476]: eid=17 class=config_sync pool='xpool3' Feb 06 22:10:59 shareserver openvpn[4867]: Sun Feb 6 22:10:59 2022 Diffie-Hellman initialized with 2048 bit key Feb 06 22:10:59 shareserver openvpn[4867]: Sun Feb 6 22:10:59 2022 Outgoing Control Channel Encryption: Cipher 'AES-256-CTR' initialized with 256 bit key Feb 06 22:10:59 shareserver openvpn[4867]: Sun Feb 6 22:10:59 2022 Incoming Control Channel Encryption: Cipher 'AES-256-CTR' initialized with 256 bit key Feb 06 22:10:59 shareserver kernel: Guest personality initialized and is inactive Feb 06 22:10:59 shareserver kernel: Initialized host personality
problem seems still here, what is very strange is that after shutting down the system and waking up, the autoexpand feature was still on even if it had been disabled.
@xgiovio I added the following line to my root crontab to automatically restart zed 30 seconds after rebooting. It's the best option I've found until there's a permanent fix. Plus, no need to disable autoexpand or anything else.
@reboot sleep 30 && systemctl restart zfs-zed
@derekakelly thanks but i really think there is something related to autoexpand setting and import cache. Yesterday I set to all pools the autoexpand feature to off. I checked again with zpool get. They were with autoexpand off. I rebooted, the config_sync messages disappeared and I verified again that autoexpand was off. ALL FINE.
Today I start the system and I notice again these config_sync messages. I verify the autoexpand feature on the pools and I notice it is still on on all the pools. So i disable autoexpand again. I restart zed. I disable autoexpand again to be sure.
I reboot the server and start again after a power off and now the autoexpand feature is off. No zed config_sync errors.
From what I'm understanding, there is some bug around zed, autoexpand and import from cache maybe.
I hope some of the main devs can give a deep look into it.
I've seen 25 minutes of config_sync event flood after autoexpand of zpool. Perhaps this is related, probably it is possible to reproduce it this way:
- qemu/kvm VM, recently updated to Debian bullseye, (it is still using sysvinit-core)
- backing block device is a LVM LV on the hypervisor
- created 820 GiB zpool "archive" on the full device /dev/vde, zfs has created two partitions, vde1 and vde9
- destroyed the zpool and re-created it with a different mountpoint
- zfs set acltype=posixacl archive; zfs set xattr=sa archive
- copied ~550 GiB file system to it
- installed zfs-auto-snapshot, disabled frequent and hourly snapshots, created an initial snapshot
Proceeded to try online resize of the zpool:
-
checked autoexpand, it was off
-
zpool set autoexpand=on archive
-
lvextend-ed the LV on the hypervisor to 840 GiB
-
virsh blockreasize ...
-
while watching dmesg on guest: [113829.223812] virtio_blk virtio6: [vde] new size: 1761607680 512-byte logical blocks (902 GB/840 GiB) [113829.224248] vde: detected capacity change from 880468295680 to 901943132160
-
zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT archive 816G 526G 290G - - 0% 64% 1.00x ONLINE -
-
zpool online archive /dev/vde (took few seconds)
-
zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT archive 832G 526G 306G - - 0% 63% 1.00x ONLINE -
-
the events start to appear in syslog and continue with frequency approx. 2/s zed[28996]: zed_disk_event: zed[28996]: #011class: EC_dev_status zed[28996]: #011subclass: dev_dle zed[28996]: #011dev_name: /dev/vde1 zed[28996]: #011path: /devices/pci0000:00/0000:00:09.0/virtio6/block/vde/vde1 zed[28996]: #011phys_path: pci-0000:00:09.0 zed[28996]: #011dev_size: 901932646400 zed[28996]: #011pool_guid: 14094888156864081133 zed[28996]: #011vdev_guid: 14550771222278676910 zed[28996]: zfs_slm_event: EC_dev_status.dev_dle zed[28996]: zfsdle_vdev_online: searching for '14550771222278676910' in 'archive' zed[28996]: Invoking "all-syslog.sh" eid=331 pid=6939 zed: eid=331 class=config_sync pool='archive' zed[28996]: Finished "all-syslog.sh" eid=331 pid=6939 exit=0
Started rsync to compare the initial snapshot to the source FS:
- rsync -aHAX --delete -i -n /mnt/src/ /srv/archive/.zfs/snapshot/snap_initial-2022-03-05-1533
Few minutes later: noticed the config_sync event flood with around 2 events per second, rsync is still running zed: eid=1337 class=config_sync pool='archive'
Event flood continued and lasted all together around 25 minutes, then: kernel: [115475.716128] vde: vde1 vde9 zed[28996]: zed_disk_event: zed[28996]: #011class: EC_dev_remove zed[28996]: #011subclass: disk zed[28996]: #011dev_name: /dev/vde1 zed[28996]: #011path: /devices/pci0000:00/0000:00:09.0/virtio6/block/vde/vde1 zed[28996]: #011phys_path: pci-0000:00:09.0 zed[28996]: #011dev_size: 901932646400 zed[28996]: #011pool_guid: 14094888156864081133 zed[28996]: #011vdev_guid: 14550771222278676910 zed[28996]: agent post event: mapping 'EC_dev_remove' to 'resource.fs.zfs.removed' zed[28996]: zed_disk_event: zed[28996]: #011class: EC_dev_add zed[28996]: #011subclass: disk zed[28996]: #011dev_name: /dev/vde1 zed[28996]: #011path: /devices/pci0000:00/0000:00:09.0/virtio6/block/vde/vde1 zed[28996]: #011phys_path: pci-0000:00:09.0 zed[28996]: #011dev_size: 901932646400 zed[28996]: #011pool_guid: 14094888156864081133 zed[28996]: #011vdev_guid: 14550771222278676910 zed[28996]: Invoking "all-syslog.sh" eid=2112 pid=31927 zed: eid=2112 class=config_sync pool='archive' zed[28996]: Finished "all-syslog.sh" eid=2112 pid=31927 exit=0 zed[28996]: zfsdle_vdev_online: setting device 'vde' to ONLINE state in pool 'archive': 0 zed[28996]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.removed' zed[28996]: Invoking "all-syslog.sh" eid=2113 pid=31929 zed[28996]: zfs_slm_event: EC_dev_add.disk zed: eid=2113 class=config_sync pool='archive' zed[28996]: Finished "all-syslog.sh" eid=2113 pid=31929 exit=0
It seems that at the end zfs has repartitioned the device, then the flood stops.
After the resize, I'm turning autoexpand off.
Just had the same issue, had to reboot the server to make zed stop doing config_sync continuously. From the daemon.log file:
Apr 8 23:15:03 server01 systemd[1]: Starting Snapshot ZFS Pool... Apr 8 23:15:03 server01 sanoid[339261]: ignored key 'recursive' from 'mergerfs' template. Apr 8 23:15:03 server01 sanoid[339261]: ignored key 'recursive' from 'mergerfs' template. Apr 8 23:15:03 server01 sanoid[339261]: ignored key 'recursive' from 'mergerfs' template. Apr 8 23:15:03 server01 sanoid[339261]: ignored key 'recursive' from 'zfs_scratch' template. Apr 8 23:15:03 server01 sanoid[339261]: INFO: cache expired - updating from zfs list. Apr 8 23:15:03 server01 sanoid[339261]: INFO: taking snapshots... Apr 8 23:15:03 server01 systemd[1]: sanoid.service: Succeeded. Apr 8 23:15:03 server01 systemd[1]: Finished Snapshot ZFS Pool. Apr 8 23:15:04 server01 systemd[1]: Starting Cleanup ZFS Pool... Apr 8 23:15:04 server01 sanoid[339269]: ignored key 'recursive' from 'mergerfs' template. Apr 8 23:15:04 server01 sanoid[339269]: ignored key 'recursive' from 'mergerfs' template. Apr 8 23:15:04 server01 sanoid[339269]: ignored key 'recursive' from 'mergerfs' template. Apr 8 23:15:04 server01 sanoid[339269]: ignored key 'recursive' from 'zfs_scratch' template. Apr 8 23:15:04 server01 sanoid[339269]: INFO: pruning snapshots... Apr 8 23:15:04 server01 systemd[1]: sanoid-prune.service: Succeeded. Apr 8 23:15:04 server01 systemd[1]: Finished Cleanup ZFS Pool. Apr 8 23:28:29 server01 zed: eid=163 class=config_sync pool='zfs-scratch' Apr 8 23:28:29 server01 zed: eid=164 class=config_sync pool='zfs-scratch' Apr 8 23:28:29 server01 zed: eid=165 class=config_sync pool='zfs-scratch' Apr 8 23:28:29 server01 zed: eid=166 class=config_sync pool='zfs-scratch' Apr 8 23:28:29 server01 zed: eid=167 class=config_sync pool='zfs-scratch' Apr 8 23:28:29 server01 zed: eid=168 class=config_sync pool='zfs-scratch' Apr 8 23:28:29 server01 zed: eid=169 class=config_sync pool='zfs-scratch' Apr 8 23:28:29 server01 zed: eid=170 class=config_sync pool='zfs-scratch' Apr 8 23:28:30 server01 zed: eid=171 class=config_sync pool='zfs-scratch' Apr 8 23:28:30 server01 zed: eid=172 class=config_sync pool='zfs-scratch' Apr 8 23:28:30 server01 zed: eid=173 class=config_sync pool='zfs-scratch' Apr 8 23:28:30 server01 zed: eid=174 class=config_sync pool='zfs-scratch' Apr 8 23:28:30 server01 zed: eid=175 class=config_sync pool='disk-hgst1' Apr 8 23:28:30 server01 zed: eid=176 class=config_sync pool='disk-hgst1' Apr 8 23:28:31 server01 zed: eid=177 class=config_sync pool='zfs-scratch' Apr 8 23:28:31 server01 zed: eid=178 class=config_sync pool='zfs-scratch' Apr 8 23:28:31 server01 zed: eid=179 class=config_sync pool='disk-stiron1' Apr 8 23:28:31 server01 zed: eid=180 class=config_sync pool='disk-stiron1' Apr 8 23:28:32 server01 zed: eid=181 class=config_sync pool='disk-stiron2' Apr 8 23:28:32 server01 zed: eid=182 class=config_sync pool='disk-stiron2' Apr 8 23:28:32 server01 zed: eid=183 class=config_sync pool='zfs-scratch' Apr 8 23:28:32 server01 zed: eid=184 class=config_sync pool='zfs-scratch' Apr 8 23:28:32 server01 zed: eid=185 class=config_sync pool='zfs-scratch' Apr 8 23:28:32 server01 zed: eid=186 class=config_sync pool='zfs-scratch' Apr 8 23:28:32 server01 zed: eid=187 class=config_sync pool='zfs-scratch' Apr 8 23:28:32 server01 zed: eid=188 class=config_sync pool='zfs-scratch' Apr 8 23:28:32 server01 zed: eid=189 class=config_sync pool='zfs-scratch' Apr 8 23:28:32 server01 zed: eid=190 class=config_sync pool='zfs-scratch' Apr 8 23:28:33 server01 zed: eid=191 class=config_sync pool='disk-hgst1' Apr 8 23:28:33 server01 zed: eid=192 class=config_sync pool='disk-hgst1' Apr 8 23:28:33 server01 zed: eid=193 class=config_sync pool='zfs-scratch' Apr 8 23:28:33 server01 zed: eid=194 class=config_sync pool='zfs-scratch' Apr 8 23:28:33 server01 zed: eid=195 class=config_sync pool='disk-stiron1' Apr 8 23:28:33 server01 zed: eid=196 class=config_sync pool='disk-stiron1' Apr 8 23:28:34 server01 zed: eid=197 class=config_sync pool='disk-stiron2' Apr 8 23:28:34 server01 zed: eid=198 class=config_sync pool='disk-stiron2' Apr 8 23:28:34 server01 zed: eid=199 class=config_sync pool='zfs-scratch' Apr 8 23:28:34 server01 zed: eid=200 class=config_sync pool='zfs-scratch' Apr 8 23:28:34 server01 zed: eid=201 class=config_sync pool='zfs-scratch' Apr 8 23:28:34 server01 zed: eid=202 class=config_sync pool='zfs-scratch' Apr 8 23:28:34 server01 zed: eid=203 class=config_sync pool='zfs-scratch' Apr 8 23:28:34 server01 zed: eid=204 class=config_sync pool='zfs-scratch' Apr 8 23:28:34 server01 zed: eid=205 class=config_sync pool='zfs-scratch' Apr 8 23:28:34 server01 zed: eid=206 class=config_sync pool='zfs-scratch' Apr 8 23:28:35 server01 zed: eid=207 class=config_sync pool='disk-hgst1' Apr 8 23:28:35 server01 zed: eid=208 class=config_sync pool='disk-hgst1' ..... (truncated due to size) Apr 9 10:52:23 server01 zed: eid=201335 class=config_sync pool='disk-stiron1' Apr 9 10:52:23 server01 zed: eid=201336 class=config_sync pool='disk-stiron1' Apr 9 10:52:24 server01 zed: eid=201337 class=config_sync pool='disk-stiron1' Apr 9 10:52:24 server01 zed: eid=201338 class=config_sync pool='disk-stiron1' Apr 9 10:52:24 server01 zed: eid=201339 class=config_sync pool='disk-stiron1' Apr 9 10:52:24 server01 zed: eid=201340 class=config_sync pool='disk-stiron1' Apr 9 10:52:25 server01 zed: eid=201341 class=config_sync pool='disk-stiron1' Apr 9 10:52:25 server01 zed: eid=201342 class=config_sync pool='disk-stiron1' Apr 9 10:52:26 server01 zed: eid=201343 class=config_sync pool='disk-stiron1' Apr 9 10:52:26 server01 zed: eid=201344 class=config_sync pool='disk-stiron1' Apr 9 10:52:26 server01 zed: eid=201345 class=config_sync pool='disk-stiron1' Apr 9 10:52:26 server01 zed: eid=201346 class=config_sync pool='disk-stiron1' Apr 9 10:52:26 server01 zed: eid=201347 class=config_sync pool='disk-stiron1' Apr 9 10:52:26 server01 zed: eid=201348 class=config_sync pool='disk-stiron1' Apr 9 10:52:27 server01 zed: eid=201349 class=config_sync pool='disk-stiron1' Apr 9 10:52:27 server01 zed: eid=201350 class=config_sync pool='disk-stiron1' Apr 9 10:52:27 server01 zed: eid=201351 class=config_sync pool='disk-stiron1' Apr 9 10:52:27 server01 zed: eid=201352 class=config_sync pool='disk-stiron1' Apr 9 10:52:28 server01 zed: eid=201353 class=config_sync pool='disk-stiron1' Apr 9 10:52:28 server01 zed: eid=201354 class=config_sync pool='disk-stiron1'
It kept doing config_sync for hours until I had to reboot the server. This also produces a fair amount of read operations on the pools: https://i.imgur.com/mykKkRH.png
Same issue here. Ubuntu 20.04.4 LTS zfs-0.8.3-1ubuntu12.14 zfs-kmod-0.8.3-1ubuntu12.14
Extract from journalctl: Jul 19 17:09:01 hockney zed[599246]: eid=384 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:01 hockney zed[599300]: eid=385 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:03 hockney zed[599456]: eid=386 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:03 hockney zed[599466]: eid=387 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:05 hockney zed[599612]: eid=388 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:05 hockney zed[599622]: eid=389 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:07 hockney zed[599765]: eid=390 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:07 hockney zed[599778]: eid=391 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:09 hockney zed[599906]: eid=392 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:09 hockney zed[599916]: eid=393 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:11 hockney zed[600090]: eid=394 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:11 hockney zed[600098]: eid=395 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:12 hockney zed[600264]: eid=396 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:13 hockney zed[600277]: eid=397 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:14 hockney zed[600407]: eid=398 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:14 hockney zed[600417]: eid=399 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:15 hockney zed[600548]: eid=400 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:16 hockney zed[600571]: eid=401 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:18 hockney zed[600720]: eid=402 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:18 hockney zed[600730]: eid=403 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:19 hockney zed[600877]: eid=404 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:20 hockney zed[600887]: eid=405 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:21 hockney zed[601034]: eid=406 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:21 hockney zed[601044]: eid=407 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:23 hockney zed[601201]: eid=408 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:23 hockney zed[601211]: eid=409 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:25 hockney zed[601356]: eid=410 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:25 hockney zed[601366]: eid=411 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:27 hockney zed[601515]: eid=412 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:27 hockney zed[601525]: eid=413 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:28 hockney zed[601676]: eid=414 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:28 hockney zed[601686]: eid=415 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:30 hockney zed[601836]: eid=416 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:30 hockney zed[601846]: eid=417 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:32 hockney zed[602004]: eid=418 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:32 hockney zed[602014]: eid=419 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:34 hockney zed[602163]: eid=420 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:34 hockney zed[602173]: eid=421 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:36 hockney zed[602310]: eid=422 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:36 hockney zed[602320]: eid=423 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:38 hockney zed[602466]: eid=424 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:38 hockney zed[602476]: eid=425 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:39 hockney zed[602631]: eid=426 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:40 hockney zed[602642]: eid=427 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:41 hockney zed[602764]: eid=428 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:41 hockney zed[602774]: eid=429 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:43 hockney zed[602925]: eid=430 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:43 hockney zed[602935]: eid=431 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:45 hockney zed[603090]: eid=432 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:45 hockney zed[603103]: eid=433 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:47 hockney zed[603246]: eid=434 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:47 hockney zed[603256]: eid=435 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:48 hockney zed[603395]: eid=436 class=config_sync pool_guid=0x86C368E087026C2F Jul 19 17:09:49 hockney zed[603397]: eid=437 class=config_sync pool_guid=0x86C368E087026C2F ...
Issue began immediately after running sudo zpool set autoexpand=on pool1. Journalctl spam every 30mins with multiple instances as per above.
After running sudo zpool set autoexpand=off pool1 and rebooting server, log spamming has stopped.
This issue seems to have been around since at least 2017, eg https://www.reddit.com/r/zfs/comments/71nw43/zfsol_zed_syslog_spam/
As mentioned in https://github.com/openzfs/zfs/issues/7366#issuecomment-377400619, the simplest reproducer is to call sg_logs -t <dev> on a vdev with autoexpand=on and zed running. If autoexpand=off then it wont work Example:
$ sudo ./zpool status
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
sda ONLINE 0 0 0
$ sudo sg_logs -t /dev/sda
Linux scsi_debug 0191
Current temperature = 38 C
Reference temperature = 65 C
$ sudo ./zpool events
TIME CLASS
$ sudo ./zpool set autoexpand=on tank
$ sudo sg_logs -t /dev/sda
Linux scsi_debug 0191
Current temperature = 38 C
Reference temperature = 65 C
$ sudo ./zpool events
TIME CLASS
Jul 29 2022 14:25:21.186928957 sysevent.fs.zfs.history_event
Jul 29 2022 14:25:25.703928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:25.764928957 sysevent.fs.zfs.config_sync
A minute or two later I noticed that running though these steps kicked off a "config_sync storm":
TIME CLASS
Jul 29 2022 14:25:21.186928957 sysevent.fs.zfs.history_event
Jul 29 2022 14:25:25.703928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:25.764928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:26.424928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:26.451928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:27.056928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:27.083928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:27.813928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:27.914928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:28.521928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:28.548928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:28.892928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:29.190928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:29.794928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:29.821928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:30.494928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:30.521928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:31.127928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:31.155928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:31.815928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:31.842928957 sysevent.fs.zfs.config_sync
zed was spamming these logs:
zfsdle_vdev_online: setting device 'sda' to ONLINE state in pool 'tank': 0
zfs_slm_event: EC_dev_status.dev_dle
Invoking "all-debug.sh" eid=709 pid=24785
Invoking "all-syslog.sh" eid=709 pid=24786
zfsdle_vdev_online: searching for '15968317179963686959' in 'tank'
Finished "all-syslog.sh" eid=709 pid=24786 time=0.009596s exit=0
Finished "all-debug.sh" eid=709 pid=24785 time=0.024603s exit=0
zed_udev_monitor: skip /dev/sda since it has a gpt partition already
zed_udev_monitor: /dev/sda9 sectors 16384 < 131072 (minimum)
zed_disk_event:
class: EC_dev_status
subclass: dev_dle
dev_name: /dev/sda1
path: /devices/pseudo_0/adapter0/host2/target2:0:0/2:0:0:0/block/sda/sda1
devid: scsi-1Linux_scsi_debug_6000-part1
phys_path: /dev/disk/by-uuid/16200864169707950020
dev_size: 199229440
pool_guid: 16200864169707950020
vdev_guid: 15968317179963686959
Invoking "all-debug.sh" eid=710 pid=24841
Invoking "all-syslog.sh" eid=710 pid=24842
Finished "all-syslog.sh" eid=710 pid=24842 time=0.010857s exit=0
Finished "all-debug.sh" eid=710 pid=24841 time=0.025998s exit=0
zfsdle_vdev_online: setting device 'sda' to ONLINE state in pool 'tank': 0
Invoking "all-debug.sh" eid=711 pid=24853
zfs_slm_event: EC_dev_status.dev_dle
Invoking "all-syslog.sh" eid=711 pid=24854
zfsdle_vdev_online: searching for '15968317179963686959' in 'tank'
Finished "all-syslog.sh" eid=711 pid=24854 time=0.009782s exit=0
Finished "all-debug.sh" eid=711 pid=24853 time=0.024482s exit=0
zed_udev_monitor: skip /dev/sda since it has a gpt partition already
zed_udev_monitor: /dev/sda9 sectors 16384 < 131072 (minimum)
zed_disk_event:
class: EC_dev_status
subclass: dev_dle
dev_name: /dev/sda1
path: /devices/pseudo_0/adapter0/host2/target2:0:0/2:0:0:0/block/sda/sda1
devid: scsi-1Linux_scsi_debug_6000-part1
phys_path: /dev/disk/by-uuid/16200864169707950020
dev_size: 199229440
pool_guid: 16200864169707950020
vdev_guid: 15968317179963686959
Invoking "all-debug.sh" eid=712 pid=24911
Invoking "all-syslog.sh" eid=712 pid=24912
Finished "all-syslog.sh" eid=712 pid=24912 time=0.011311s exit=0
Finished "all-debug.sh" eid=712 pid=24911 time=0.026689s exit=0
So maybe the "dev status change" event is causing zed or something else to do an operation on the vdev partition, which in turn kicks off a new "dev status change" udev event, causing the feedback loop to continue. For reference, I ran the reproducer on Fedora 36 running master.
Anyway, now that I have a reproducer I can look into a fix.
So maybe the "dev status change" event is causing zed or something else to do an operation on the vdev partition, which in turn kicks off a new "dev status change" udev event, causing the feedback loop to continue
Yep, that was the case:
<disk change event>
zfsdle_vdev_online()
zpool_vdev_online()
zpool_relabel_disk()
open(vdev)
close(vdev) <--- generates new disk change event
The easiest fix is to compare the vdev's "original size" (vs->vs_pspace) to the new size after the change event. If the vdev is the same size, then just ignore the event.
The problem persists. Linux 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux zfs-2.0.3-9 zfs-kmod-2.0.3-9
After installing webmin or restartin webmin, the syslog is spamming with:
zed: eid=360 class=config_sync pool='zfsdata'
Interesting that for the second pool there is no such entries. After restarting zfs-zed, the messages stop appearing.
Or setting autoexpand=off also fix the problem.
I can confirm this still happens with my system as well.
I have mitigated my issue at boot with a crontab.
@reboot sleep 30 && systemctl restart zfs-zed
However whenever I open webmin, the issue persists until I first close webmin, and then restart zed
Ubuntu Linux 22.04.2
Same here with:
root@vm-postgresql64:/home/user# uname -a
Linux vm-postgresql64 5.15.0-78-generic #85-Ubuntu SMP Fri Jul 7 15:25:09 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
root@vm-postgresql64:/home/user# zfs version
zfs-2.1.5-1ubuntu6~22.04.1
zfs-kmod-2.1.5-1ubuntu6~22.04.1
root@vm-postgresql64:/home/user# journalctl -l | grep zed | wc -l
584134
After 6 hours of "class=config_sync"-events the pool became unavailable:
Aug 24 05:12:06 vm-postgresql64 zed[391989]: eid=588210 class=config_sync pool='tank'
Aug 24 05:12:06 vm-postgresql64 kernel: sdc: sdc1 sdc9
Aug 24 05:12:06 vm-postgresql64 zed[392006]: eid=588212 class=statechange pool='tank' vdev=sdc1 vdev_state=UNAVAIL
Aug 24 05:12:06 vm-postgresql64 zed[392013]: eid=588211 class=vdev.unknown pool='tank' vdev=sdc1
Aug 24 05:12:06 vm-postgresql64 kernel: WARNING: Pool 'tank' has encountered an uncorrectable I/O failure and has been suspended.
Aug 24 05:12:06 vm-postgresql64 zed[392051]: eid=588213 class=io_failure pool='tank'
After reboot all is running well...