Performance degradation above 90% capacity and zpool import failure after power loss above 90%
Environment
- Operating System: [Ubuntu 24.04]
- OpenZFS Version: [2.3.3, master branch]
- Hardware: [HBA unraid model, HDDs]
Issue 1: Performance Degradation
When my ZFS pool reaches approximately 90% capacity, I experience significant performance degradation, particularly with sync writes and general I/O operations. This seems to be a common issue but might be a potential area for improvement or optimization within OpenZFS.
Issue 2: zpool import failure after power loss
I encountered a critical issue where after an unexpected power outage, my zpool import -f -F -m command failed. set readonly could import. This occurred when the pool capacity was above 90%. This is concerning as ZFS is designed for data integrity.
Steps to reproduce (for Issue 2):
- Fill a ZFS pool beyond 90% capacity.
- Perform I/O operations.
- Simulate an unexpected power loss.
- Attempt to run
zpool import.
Expected behavior: The pool should import successfully with minimal or no data loss, as expected from ZFS.
Actual behavior:
The zpool import -f -F -m command fails, preventing access to the data.
I am looking for guidance, potential workarounds, or confirmation if this is a known bug.
While very high pool utilization may lead to higher fragmentation and so lower performance, what you describe is not a general case. It might be even a combination of many factors, but you provide no data for diagnostics: how you import failed, any messages on console or in dbgmsg, etc. Pool might get corruption earlier, and reboot was just an occasion to notice it. There are reasons why there are recommendations for ECC RAM and reliable hardware.
The zpool import -f -F -m command got stuck indefinitely, and top showed the system load was at its maximum. After a long wait of over ten hours, the hard drive indicator lights stopped flashing, and dmesg reported a ZFS-related panic error (the content is no longer visible because the zpool was destroyed and cannot be reproduced).
The hardware is a Dell R730 with a hardware array card and ECC memory, 64GB of RAM, and the zpool is less than 30TB.
Attempts were made to transfer the pool to a server with better performance to try importing it, but all attempts failed multiple times.
Would you have a screenshot of the panic, may be we could have something to talk about.
That seems to be the first thing that happens:
Nov 26 15:28:13 pve1 kernel: WARNING: ZFS read log block error 6, dataset mos, seq 0x40e42dd
Nov 26 15:28:15 pve1 kernel: WARNING: ZFS read log block error 6, dataset mos, seq 0xd02
Nov 26 15:28:16 pve1 kernel: WARNING: ZFS read log block error 6, dataset fly, seq 0x40e42dd
Nov 26 15:28:16 pve1 kernel: WARNING: ZFS read log block error 6, dataset fly/share, seq 0xd02
After that you have hanging zfs taks like:
Nov 26 15:32:18 pve1 kernel: INFO: task zpool:9593 blocked for more than 122 seconds.
...
Nov 26 15:32:18 pve1 kernel: INFO: task txg_sync:11307 blocked for more than 122 seconds.
...
Nov 26 15:34:20 pve1 kernel: INFO: task zpool:9593 blocked for more than 245 seconds.
which culminates in a panic:
Nov 26 15:35:13 pve1 kernel: PANIC: zfs: adding existent segment to range tree (offset=1c367610000 size=2000)
and further hang tasks until system is shutdown.
So root cause seems to be the read block errors. Dont know what that means. Just saying ;-)