solo5
solo5 copied to clipboard
block: add support for discard
The idea is to support operations compatible with mirage-block-unix's discard:
discard sector n signals that the n sectors starting at sector are no longer needed and the contents may be discarded. Note the contents may not actually be deleted: this is not a "secure erase".
Which I would update to say this (matching the Unix implementation):
discard sector n signals that the n sectors starting at sector are no longer needed and the contents may be discarded. Future reads will return zeroes. Note that the contents may not actually be impossible to recover: this is not a "secure erase".
Yes. You probably want to look at #325 first though.
Since this interface should be sector aligned, should we break consistency and take offsets and length as sectors? Would it simplify the checking the spt backend would need to do in case the implementation is specialized there (re https://github.com/Solo5/solo5/issues/325#issuecomment-465483894)?
That depends on what the backend implementation needs to do. As I've never needed to invoke a discard
-type op from Linux, I don't know.
The Solo5 interface I would prefer is:
solo5_result_t solo5_block_discard(solo5_off_t offset, size_t size)
with the usual restrictions (offset
must be block-size-aligned, size
must be a multiple of block-size).
If you find out what the backend(s) actually need(s) to do to implement a discard (i.e. at least the hvt
tender and spt
bindings) I'll be able to tell you more.
With spt bindings on Linux, this will use fallocate with FALLOC_FL_PUNCH_HOLE
and fallback to zeroing manually if the filesystem doesn't support it.
I'm not sure about tenders/hvt because that should just validate and pass params?
I've also looked at virtio (spec amendment); it would use the VIRTIO_BLK_T_WRITE_ZEROES command with the unmap bit set (with some fallbacks which are easy to add). The offsets and lengths are expressed in 512-byte units, but the alignment requirement may be greater.
(just to document the virtio fallbacks while the spec is fresh on my mind: if VIRTIO_BLK_F_WRITE_ZEROES is not supported, do the whole thing manually. Otherwise, call VIRTIO_BLK_T_WRITE_ZEROES with unmap set to whether VIRTIO_BLK_F_DISCARD is supported)
With spt bindings on Linux, this will use fallocate with FALLOC_FL_PUNCH_HOLE and fallback to zeroing manually if the filesystem doesn't support it.
That man page does not actually mention anything about the call issuing a discard
(or TRIM) to the hardware. Do you have a source for this?
Anyway, so the Linux backends need to issue a fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, offset, len)
call. For spt, the seccomp filter needs to ensure that SYS_fallocate
can only be called with that particular value of mode
. If I understand the behaviour as documented correctly, the filter should not need to check offset
or len
.
For hvt, you need to create a hypercall and back it with that implementation.
Before you actually start on this, can you check with the *BSD folks you have around at the retreat what the equivalent calls are there?
Oh. One more thing, just re-reading the manpage. It only talks about filesystem-backed fds. How do we do the equivalent if the fd
is pointing directly to a block device (this is a valid and important use-case!).
Linux also implements it for block devices: See blkdev_fallocate in https://github.com/torvalds/linux/blob/master/fs/block_dev.c
At any rate, the logic for this will be the same regardless of what the backing store is; if ENOTSUP is returned, write zeroes manually.
I don't think this exists on BSD per this recent thread: https://freebsd-arch.freebsd.narkive.com/JKZnzc4p/hole-punching-trim-etc But since we need fallback logic regardless, that is not too problematic.
That man page does not actually mention anything about the call issuing a discard (or TRIM) to the hardware. Do you have a source for this?
That's implementation and configuration (mount options, actual device capabilities) dependent. It's also an optimization and an implementation detail.
At any rate, the logic for this will be the same regardless of what the backing store is; if ENOTSUP is returned, write zeroes manually.
Right, just make sure that you do that in a single pwrite()
call, that way you won't have to modify the pwrite
part of the seccomp filter for spt.
@g2p Moving your questions from Slack here, as I'd like the design discussion to be in one place:
What would I use to allocate a buffer of zeroes from the bindings side? [...] mem_ialloc_pages + memset?
Correct.
The buffer size would be bounded, but to a fairly large value. I'm planning on using 1M for transfer efficiency, or the block size, whichever is greater.
I don't think keeping a 1M buffer around is an acceptable cost to pay just for the sake of being able to emulate discard.
Stepping back a bit, do we need to emulate it at all? Why? In my mind, discard is just an "optimization" for SSDs and/or thinly provisioned storage. If the host doesn't support it, can't it just be a no-op?
I need it to actually ensure later reads return zeroes. If there's an error code that can unambiguously mean ENOTSUP, I'd be okay moving any fallback emulation to a layer higher than Solo5. Or an out parameter. Unfortunately it can't be done with a block_info flag as some of the implementations are based on ENOTSUP return codes, which means you don't know ahead of time.
With further thought, I'd be good with flags that ask for zeroing or not, in addition to the out parameter and higher-level fallback logic mentioned above. I'm not sure if I would expose the additional flags to the mirage interface. They might be good for future-proofing, I guess, but will trigger breakage on the mirage-block-unix interface.
I need it to actually ensure later reads return zeroes.
Ok. It seems to me then that the simplest possible interface at the Solo5 layer is that mentioned in my previous comment (https://github.com/Solo5/solo5/issues/329#issuecomment-470601006), with the addition of a SOLO5_R_EOPNOTSUPP
to solo5_result_t
, which solo5_block_discard()
should return if the host/implementation does not support the operation.
In other words, no emulation, and the application flow becomes:
r = solo5_block_discard(offset, size);
if (r == SOLO5_R_EOPNOTSUPP) {
/* It is now the application's responsibility to fall back to writing zeroes to (offset, size), if it desires that behaviour */
}
I'd be good with flags that ask for zeroing or not,
I'm not sure what you mean by that, but if you mean adding a "is discard supported" flag to the struct returned by solo5_block_info()
, I don't think there is anything the application gains in the above case?
I like the EOPNOTSUPP approach very much, will do that.
Re an explicit "ask for zeroing" flag (as opposed to discarding and having uninitialized data appear), it's not something I plan to use, and after looking at the complexity of implementing both code paths in virtio, I don't think it's worth it. We'll either offer safe and fast zeroing, or let the application deal.
Sounds good to me.
I've started testing my virtio implementation, but it looks like Qemu doesn't support discard for virtio-blk, only for scsi. I don't know of another supported VMM on Linux so I'll drop that code for now. Still making progress on the spt/hvt backends.
crosvm has support for the feature: https://chromium.googlesource.com/chromiumos/platform/crosvm/+/7621d910f56ff85400b252f88fdef324a1cc13d6%5E%21/#F0 via https://bugs.chromium.org/p/chromium/issues/detail?id=850998 .
Not sure how hard it would be to integrate.
I've built and run crosvm, but I only get messages from early boot, I don't think it can run Solo5's kernel. So the virtio bits will have to be dropped for now.