Feature Request: special_small_blocks improvements
special_small_blocks is causing me no small amount of heartburn as a ZFS administrator on a few deployments. Specifically, when I lower recordsize on a filesystem, or create a descendant filesystem with a smaller recordsize than the parent, I have to remember to lower special_small_blocks as well. Moreover, when I lower special_small_blocks, I have to remember its own rules.
Intuitively, when I see a filesystem with recordsize=128K and special_small_blocks=128K, I interpret this to mean that blocks smaller than the recordsize will go to special vdev(s), but this is not the case. special_small_blocks has to be less than recordsize or all blocks will go to special vdev(s). To me this clearly violates the principle of least astonishment.
(N.B. The complement of the above is also true for a ZFS deployment where the goal is to have most blocks go to special vdev(s) but for the exceptions. Lots of rules to remember and opportunities to make errors and have blocks go to data vdev(s) instead.)
Maybe it's just me? but I doubt it. This is conjecture but if a seasoned ZFS, LVM2, and GPFS administrator such as myself is shooting himself in the foot on a regular basis then I can't imagine what's happening in the wild.
Initially, I had felt that the implementation of the special_small_blocks dataset property should be changed from less than or equal to to less than. This would have been a moderately breaking change for some but fortunately it would just mean fewer blocks going to special vdev(s) until administrative intervention. However, I know that could be a tough pill to swallow, and I actually have a better idea that addresses my concerns without changing the meaning of special_small_blocks.
Instead, I'd like to propose a new dataset property special_small_blocks_behavior that can take the following values:
normal: current behavior (and the default value for this new property)auto: blocks smaller thanrecordsizego to special vdev(s)all: all blocks go to special vdev(s)
In all cases, the current special_small_blocks property will act as a limit on the largest block size permitted to go to special vdev(s). As you can see this neatly aligns with the current behavior when special_small_blocks_behavior is set to "normal." Furthermore, special_small_blocks must be raised from the default value of zero (0) in order for the other behaviors to take effect.
At pool creation time this opens up a very tidy solution that will make the lives of many administrators much easier as they create and update datasets with different record sizes over time:
zpool create tank \
-O recordsize=128K \
-O special_small_blocks_behavior=auto \
-O special_small_blocks=128K
Now administrators can create datasets with the inherited recordsize, datasets with recordsize=32K, datasets with recordsize=1M, etc. and in all cases get sensible behavior.
Administrators that want just about everything to go to special vdev(s) without having to remember all the rules for it six months down the line now have a solution as well:
zpool create tank \
-O recordsize=16K \
-O special_small_blocks_behavior=all \
-O special_small_blocks=128K
And the current behavior continues to work as expected:
zpool create tank \
-O recordsize=128K \
-O special_small_blocks=32K
Obviously there are probably other ways to address my concerns, and I'm doing a lot of lot of hand-waving here on the implementation, so thank you in advance for considering my proposal. I look forward to hearing thoughts from others.
Is there a use-case for sending all records to the special vdev? Because surely you'd effectively be ignoring all standard devices in the pool, such that the only data that will be located on them is either data that was written prior to the setting being changed, or after the special vdev becomes full?
I was just wondering if a simpler solution might be to issue some kind of warning when special_small_blocks and recordsize are the same? This wouldn't prevent it from being setup this way if desired, but would hopefully catch cases where a user accidentally changes one setting without changing the other?
Well, creating a new setting variable clutters up what will actually happen more, not less.
I suggest adding the ability to add other values to special_small_blocks:
- Percent values: 0.1-99.0 of the recordsize
- Factor values: 1-32768x data stored in a record, expressed as factor of the pool sector size
- Keyword(s):
-
not_max_recordsizefor any record which's size is not equal to the configuredrecordsize
@Haravikk wrote
I was just wondering if a simpler solution might be to issue some kind of warning when
special_small_blocksandrecordsizeare the same?
Yeah, while setting recordsize and special_small_blocks (also with the new values) it should be checked that special_small_blocks is >= pool sector size and < recordsize.
I suggest adding the ability to add other values to special_small_blocks:
Yes, or something like special_small_blocks=32K, special_small_blocks=32K,auto, and special_small_blocks=32K,all would get the job done. Either way. As a fsadmin I simply need another knob to turn.
Is there a use-case for sending all records to the special vdev?
I can't be the only person using a "dummy" data vdev and a huge special vdev to work around the limitations of ZFS on fast flash...
I don't have a use case for putting all on special, but I could imagine that there's one file system where you need faster performance and want to put it entirely on flash. I would definitely like to be able to set special_small_blocks to 128k-1 or something like that.
How sensitive if performance to the block size? I have a general-purpose file server use by CS students and researchers. It is likely that setting blocksize to 256k so I can set special_small_blocks to 128k would affect performance?
@clhedrick you may just create additional pool on flash and mount it wherever you want. You'll have all the benefits and even better flexibility.
any way this will get fixed? having to set special_small_blocks to 2x smaller than recordsize is definitely suboptimal.
@clhedrick you may just create additional pool on flash and mount it wherever you want. You'll have all the benefits and even better flexibility.
In case there are no real-world restrictions regarding money, space in the server, ... you have a point.
But for normal humans there might well be the use case that the admin wants to put a dataset (like a database) onto the availablefast media, while at the same time being able to offload ZFS metadata onto the same physical media - without having to decide beforehand onto a fixed and final distribution of the available space between these two types of data.