zfs_frag icon indicating copy to clipboard operation
zfs_frag copied to clipboard

zfs_frag wrongly sees benign local reshuffling as fragmentation

Open dechamps opened this issue 6 months ago • 0 comments

When writing a file sequentially, behind the scenes ZFS will not quite write the file blocks in their original order; blocks will randomly swap places but only locally (not far away across the file). This is explained in openzfs/zfs#7110 (which, ironically, was reported by someone also writing a fragmentation reporting utility).

For example:

# dd if=/dev/zero of=/tmp/testfrag bs=1 count=1 oseek=100M
# zpool create testfrag /tmp/testfrag
# dd if=/dev/urandom of=/testfrag/file bs=128K count=8
# zdb -ddddd testfrag/ 0:-1:f | tee /tmp/zdb.txt
Indirect blocks:
               0 L1  0:111400:400 20000L/400P F=8 B=8/8 cksum=0000009e86b4573c:0000455b84b3a1e3:00135bb3584c3975:04402f380eecbd11
               0  L0 0:31400:20000 20000L/20000P F=1 B=8/8 cksum=00003fee9d75d44e:0ff78600ec03cf68:f004d6165ce2241b:877bf82c62e5ec13
           20000  L0 0:11000:20000 20000L/20000P F=1 B=8/8 cksum=00003fed6858db2f:0ffff8ab5c496d24:eda87e38e8590749:4bfe71b0109f2366
           40000  L0 0:91400:20000 20000L/20000P F=1 B=8/8 cksum=00003fae90951dd5:0ff7e56157ee433e:030d532f36d941ab:a3616d30d3d433a8
           60000  L0 0:71400:20000 20000L/20000P F=1 B=8/8 cksum=0000402fc914e79e:1005670d8b398e52:240b8ee9a6c940bf:096195fe8566901f
           80000  L0 0:51400:20000 20000L/20000P F=1 B=8/8 cksum=00003feb8c5d6ba2:0ffa6a1b820ab615:1cf80f0f84399b1c:f8a6418eca50b4dd
           a0000  L0 0:b1400:20000 20000L/20000P F=1 B=8/8 cksum=00003fb156cede28:0fe9263b6c72837f:b1d1f6599fc77130:45562531b40dd3be
           c0000  L0 0:d1400:20000 20000L/20000P F=1 B=8/8 cksum=0000402eb436a467:10040c7427c6c098:24c7ff03f4c4af2f:31b03a9dc6163deb
           e0000  L0 0:f1400:20000 20000L/20000P F=1 B=8/8 cksum=00004009d5939207:0ff6941e8edf4ec7:170e71396c2cfe2e:5619b42348900086

The blocks were clearly written out of order, as shown by the block addresses.

The reason why this is not a real problem in practice is because this reshuffling only happens locally (i.e. blocks will not be slung far away across the disk). When the file is read sequentially (e.g. by the prefetcher), the I/O scheduler will merge the random read requests back into one big sequential read, so this does not cause actual performance degradation in practice.

The problem is, zfs_frag does not see it that way:

$ python3 zfs_frag.py /tmp/zdb.txt 
There are 1 files.
There are 8 blocks and 7 fragment blocks.
There are 5 fragmented blocks 71.43%.
There are 2 contiguous blocks 28.57%.
Name: /file Blocks: 7 Fragmentation 71.43%

Now, it is technically correct to say that the file is fragmented, but in practice it is not fragmented in a way that would actually matter. This makes the output of zfs_frag somewhat misleading, and will result in most files being reported as being highly fragmented.

Ideally, zfs_frag should be made smart enough to ignore benign local reshuffling of blocks. For example, it could simulate the behavior of the ZFS prefetcher and only report fragmentation if it seems likely that the scheduler would actually have to seek when the file is read sequentially.

dechamps avatar Aug 21 '24 18:08 dechamps