zfs Support recordsize that's a multiple of 4k but not a power of two.

Describe the feature would like to see added to OpenZFS

Support recordsize that's a multiple of 4k but not a power of two.

How will this feature improve OpenZFS?

When using raidz2 with a suboptimal number of disks using a smallish recordsize such as 128k the lost space due to raidz2 block padding could become 5-10%, especially noticable when storing relatively large uncompressible files.

If we could set recordsize to an optimal number (say 120k or 144k depending on the # of disks), the raidz2 padding overhead would become close to 0%.

Additional context

Obviously it should only be enabled on datasets where the possible performance penalty due to non power of 2 recordsize is low.

Jun 04 '25 05:06 strigeus

Block and record sizes inside OpenZFS are nearly always stored and manipulated as bit shifts, not absolute sizes, and are stored on disk as well. That alone would make this change pretty much a non-starter.

Jun 04 '25 05:06 robn

Not knowing a lot of zfs internals, but I do notice some uses of dn_datablkshift. However it doesn't seem to be that many and perhaps it could be worth 5-10% of space savings.

Are there many other such shifts too except for those in dnode/dmu etc?

Jun 04 '25 05:06 strigeus

On disk they appear to be stored in multiples of 512 bytes (LSIZE) - or am I missing something there?

And also this in dnode_phys: uint16_t dn_datablkszsec; /* data block size in 512b sectors */

Jun 04 '25 05:06 strigeus

I am not saying it is completely impossible, but it may be quite a messy change. On the other side, I am not sure it is really productive to go that way, optimizing for small recordsizes on wide RAIDZ2. Sure, we could possibly improve space efficiency, but at that point we would end up with tons of 4KB I/Os per leaf vdev. It would either trash the disks, or I/O aggregation code in ZFS I/O scheduler. If you want to store small blocks, use narrower vdevs, so that each disk receive at least 16KB or more. At that point space efficiency is no longer a problem.

Jun 04 '25 13:06 amotin

If you want to store small blocks, use narrower vdevs, so that each disk receive at least 16KB or more. At that point space efficiency is no longer a problem.

Not sure I follow. If I use raidz2 with 7 disks and 128 kB recordsize, each disk receives around 24 kB (which is more than 16 kB), yet the space overhead is 7.14% according to the "ZFS overhead calc" spreadsheet. Thus space efficiency is not quite related to disks receiving 16 kB or more.

Had the recordsize instead been 120 kB, then the space overhead drops from 7.14% to 0.0%.

Jun 04 '25 15:06 strigeus

Thus space efficiency is not quite related to disks receiving 16 kB or more.

It is related. The more data each disk get, the less will matter the last row fullness.

Jun 04 '25 15:06 amotin

On a 7 disk raidz2, a 120 kB recordsize instead of 128 kB recordsize does not result in a massive increase of 4k writes.

Jun 04 '25 18:06 strigeus

But about 7% of space usage I would not care enough to go patching the code or fine-tuning settings. Especially since any compression will eliminate any possible remnants of sense.

Jun 04 '25 18:06 amotin

I get that for many workloads, 7% space loss isn't worth code complexity or tuning - especially with compression. But at the same time there exist a lot of other usecases with data that can't be further compressed, and that 7% can translate into tens of terabytes of waste at scale.

Jun 04 '25 19:06 strigeus