openzfs icon indicating copy to clipboard operation
openzfs copied to clipboard

ZFS 2.1.6 on Monterey 100% CPU usage when eSATA disks drop out

Open kneutron opened this issue 1 year ago • 1 comments

System information

Type Version/Name
Distribution Name MacOS Monterey
Distribution Version 12.7.1
MacOS Kernel 21.6.0
Architecture AMD64 / Mac pro ("trashcan") late 2013, 3.5GHz 6-core Intel Xeon E5 with 32GB RAM
ZFS Version zfs-macOS-2.1.6-1 / zfs-kmod-2.1.6-1
SPL Version n/a

Describe the problem you're observing

PROBOX 4-bay enclosure (several years older version) connected via eSATA, via expansion dock, via Thunderbolt 2 cable with 3 disks, 2x4TB Seagate NAS and 1x 6TB Toshiba NAS, all 3.5 inch spinners

Disks dropped out during a large chown operation on a dataset due to a bad eSATA cable; all CPU cores went to 100% and there were no timestamps in the zed log, although there were entries; and one documented the disks dropping out:

[ Excerpt from zed log ] Invoking "all-syslog.sh" eid=29 pid=768 Invoking "snapshot_mount.sh" eid=29 pid=769 Diagnosis Engine: discarding 'resource.fs.zfs.snapshot_mount for vdev 0 Finished "all-syslog.sh" eid=29 pid=768 time=0.007513s exit=0 Finished "snapshot_mount.sh" eid=29 pid=769 time=0.022611s exit=127 Invoking "all-syslog.sh" eid=30 pid=777

Diagnosis Engine: discarding 'resource.fs.zfs.removed for vdev 11771030784008253581 Retire Agent: zfs_retire_recv: 'resource.fs.zfs.removed' Retire Agent: matched vdev 11771030784008253581

Finished "all-syslog.sh" eid=30 pid=777 time=0.008458s exit=0 Invoking "all-syslog.sh" eid=31 pid=779 Invoking "data-notify.sh" eid=31 pid=780 Finished "data-notify.sh" eid=31 pid=780 time=0.007834s exit=3 Invoking "all-syslog.sh" eid=32 pid=783 Invoking "data-notify.sh" eid=32 pid=784 Finished "all-syslog.sh" eid=31 pid=779 time=0.014180s exit=0 Finished "data-notify.sh" eid=32 pid=784 time=0.008389s exit=3 Invoking "all-syslog.sh" eid=33 pid=786 Invoking "data-notify.sh" eid=33 pid=788 Finished "all-syslog.sh" eid=32 pid=783 time=0.014474s exit=0 Finished "data-notify.sh" eid=33 pid=788 time=0.007542s exit=3 Missed 1 events

All 3 zpools (all are single-disk, no mirrors or raidz) went into suspended state but zfs/zpool commands still responded. System was able to shutdown OK from commandline without having to do a hard reset and recovered/rebooted nicely with a replacement eSATA cable and the chown completed OK. System ran normally with good cable. Have not run scrub yet.

Describe how to reproduce the problem

I was going to toss the bad eSATA cable, but can hold onto it and ship it for testing if you like. Probox is up to you.

Include any warning/errors/backtraces from the system logs

Diagnosis Engine: discarding 'resource.fs.zfs.removed for vdev 11771030784008253581
Retire Agent: zfs_retire_recv: 'resource.fs.zfs.removed'
Retire Agent: matched vdev 11771030784008253581

The zfs zed error log is currently ~15MB, which is why I posted an excerpt. Can provide full log if necessary.

Expected behavior would be to have timestamps in the zed log, document where disk(s) dropped out and any corrective zed daemon action, and for CPU cores not to go redlined / 100%

kneutron avatar Nov 10 '23 01:11 kneutron

The best thing would be to run sudo spindump when it has happened, it will show which thread(s) are busy and what they are doing.

lundman avatar Nov 10 '23 01:11 lundman