zfs_frag
zfs_frag copied to clipboard
zfs_frag wrongly sees benign local reshuffling as fragmentation
When writing a file sequentially, behind the scenes ZFS will not quite write the file blocks in their original order; blocks will randomly swap places but only locally (not far away across the file). This is explained in openzfs/zfs#7110 (which, ironically, was reported by someone also writing a fragmentation reporting utility).
For example:
# dd if=/dev/zero of=/tmp/testfrag bs=1 count=1 oseek=100M
# zpool create testfrag /tmp/testfrag
# dd if=/dev/urandom of=/testfrag/file bs=128K count=8
# zdb -ddddd testfrag/ 0:-1:f | tee /tmp/zdb.txt
Indirect blocks:
0 L1 0:111400:400 20000L/400P F=8 B=8/8 cksum=0000009e86b4573c:0000455b84b3a1e3:00135bb3584c3975:04402f380eecbd11
0 L0 0:31400:20000 20000L/20000P F=1 B=8/8 cksum=00003fee9d75d44e:0ff78600ec03cf68:f004d6165ce2241b:877bf82c62e5ec13
20000 L0 0:11000:20000 20000L/20000P F=1 B=8/8 cksum=00003fed6858db2f:0ffff8ab5c496d24:eda87e38e8590749:4bfe71b0109f2366
40000 L0 0:91400:20000 20000L/20000P F=1 B=8/8 cksum=00003fae90951dd5:0ff7e56157ee433e:030d532f36d941ab:a3616d30d3d433a8
60000 L0 0:71400:20000 20000L/20000P F=1 B=8/8 cksum=0000402fc914e79e:1005670d8b398e52:240b8ee9a6c940bf:096195fe8566901f
80000 L0 0:51400:20000 20000L/20000P F=1 B=8/8 cksum=00003feb8c5d6ba2:0ffa6a1b820ab615:1cf80f0f84399b1c:f8a6418eca50b4dd
a0000 L0 0:b1400:20000 20000L/20000P F=1 B=8/8 cksum=00003fb156cede28:0fe9263b6c72837f:b1d1f6599fc77130:45562531b40dd3be
c0000 L0 0:d1400:20000 20000L/20000P F=1 B=8/8 cksum=0000402eb436a467:10040c7427c6c098:24c7ff03f4c4af2f:31b03a9dc6163deb
e0000 L0 0:f1400:20000 20000L/20000P F=1 B=8/8 cksum=00004009d5939207:0ff6941e8edf4ec7:170e71396c2cfe2e:5619b42348900086
The blocks were clearly written out of order, as shown by the block addresses.
The reason why this is not a real problem in practice is because this reshuffling only happens locally (i.e. blocks will not be slung far away across the disk). When the file is read sequentially (e.g. by the prefetcher), the I/O scheduler will merge the random read requests back into one big sequential read, so this does not cause actual performance degradation in practice.
The problem is, zfs_frag
does not see it that way:
$ python3 zfs_frag.py /tmp/zdb.txt
There are 1 files.
There are 8 blocks and 7 fragment blocks.
There are 5 fragmented blocks 71.43%.
There are 2 contiguous blocks 28.57%.
Name: /file Blocks: 7 Fragmentation 71.43%
Now, it is technically correct to say that the file is fragmented, but in practice it is not fragmented in a way that would actually matter. This makes the output of zfs_frag
somewhat misleading, and will result in most files being reported as being highly fragmented.
Ideally, zfs_frag
should be made smart enough to ignore benign local reshuffling of blocks. For example, it could simulate the behavior of the ZFS prefetcher and only report fragmentation if it seems likely that the scheduler would actually have to seek when the file is read sequentially.