solo5 block: add support for discard

The idea is to support operations compatible with mirage-block-unix's discard:

discard sector n signals that the n sectors starting at sector are no longer needed and the contents may be discarded. Note the contents may not actually be deleted: this is not a "secure erase".

Which I would update to say this (matching the Unix implementation):

discard sector n signals that the n sectors starting at sector are no longer needed and the contents may be discarded. Future reads will return zeroes. Note that the contents may not actually be impossible to recover: this is not a "secure erase".

Mar 07 '19 14:03 g2p

Yes. You probably want to look at #325 first though.

Mar 07 '19 15:03 mato

Since this interface should be sector aligned, should we break consistency and take offsets and length as sectors? Would it simplify the checking the spt backend would need to do in case the implementation is specialized there (re https://github.com/Solo5/solo5/issues/325#issuecomment-465483894)?

Mar 07 '19 16:03 g2p

That depends on what the backend implementation needs to do. As I've never needed to invoke a discard-type op from Linux, I don't know.

The Solo5 interface I would prefer is:

solo5_result_t solo5_block_discard(solo5_off_t offset, size_t size)

with the usual restrictions (offset must be block-size-aligned, size must be a multiple of block-size).

If you find out what the backend(s) actually need(s) to do to implement a discard (i.e. at least the hvt tender and spt bindings) I'll be able to tell you more.

Mar 07 '19 16:03 mato

With spt bindings on Linux, this will use fallocate with FALLOC_FL_PUNCH_HOLE and fallback to zeroing manually if the filesystem doesn't support it.

I'm not sure about tenders/hvt because that should just validate and pass params?

I've also looked at virtio (spec amendment); it would use the VIRTIO_BLK_T_WRITE_ZEROES command with the unmap bit set (with some fallbacks which are easy to add). The offsets and lengths are expressed in 512-byte units, but the alignment requirement may be greater.

Mar 07 '19 17:03 g2p

(just to document the virtio fallbacks while the spec is fresh on my mind: if VIRTIO_BLK_F_WRITE_ZEROES is not supported, do the whole thing manually. Otherwise, call VIRTIO_BLK_T_WRITE_ZEROES with unmap set to whether VIRTIO_BLK_F_DISCARD is supported)

Mar 07 '19 18:03 g2p

With spt bindings on Linux, this will use fallocate with FALLOC_FL_PUNCH_HOLE and fallback to zeroing manually if the filesystem doesn't support it.

That man page does not actually mention anything about the call issuing a discard (or TRIM) to the hardware. Do you have a source for this?

Anyway, so the Linux backends need to issue a fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, offset, len) call. For spt, the seccomp filter needs to ensure that SYS_fallocate can only be called with that particular value of mode. If I understand the behaviour as documented correctly, the filter should not need to check offset or len.

For hvt, you need to create a hypercall and back it with that implementation.

Before you actually start on this, can you check with the *BSD folks you have around at the retreat what the equivalent calls are there?

Mar 08 '19 09:03 mato

Oh. One more thing, just re-reading the manpage. It only talks about filesystem-backed fds. How do we do the equivalent if the fd is pointing directly to a block device (this is a valid and important use-case!).

Mar 08 '19 09:03 mato

Linux also implements it for block devices: See blkdev_fallocate in https://github.com/torvalds/linux/blob/master/fs/block_dev.c

At any rate, the logic for this will be the same regardless of what the backing store is; if ENOTSUP is returned, write zeroes manually.

Mar 08 '19 10:03 g2p

I don't think this exists on BSD per this recent thread: https://freebsd-arch.freebsd.narkive.com/JKZnzc4p/hole-punching-trim-etc But since we need fallback logic regardless, that is not too problematic.

Mar 08 '19 10:03 g2p

That man page does not actually mention anything about the call issuing a discard (or TRIM) to the hardware. Do you have a source for this?

That's implementation and configuration (mount options, actual device capabilities) dependent. It's also an optimization and an implementation detail.

Mar 08 '19 10:03 g2p

At any rate, the logic for this will be the same regardless of what the backing store is; if ENOTSUP is returned, write zeroes manually.

Right, just make sure that you do that in a single pwrite() call, that way you won't have to modify the pwrite part of the seccomp filter for spt.

Mar 08 '19 16:03 mato

@g2p Moving your questions from Slack here, as I'd like the design discussion to be in one place:

What would I use to allocate a buffer of zeroes from the bindings side? [...] mem_ialloc_pages + memset?

Correct.

The buffer size would be bounded, but to a fairly large value. I'm planning on using 1M for transfer efficiency, or the block size, whichever is greater.

I don't think keeping a 1M buffer around is an acceptable cost to pay just for the sake of being able to emulate discard.

Stepping back a bit, do we need to emulate it at all? Why? In my mind, discard is just an "optimization" for SSDs and/or thinly provisioned storage. If the host doesn't support it, can't it just be a no-op?

Mar 09 '19 12:03 mato

I need it to actually ensure later reads return zeroes. If there's an error code that can unambiguously mean ENOTSUP, I'd be okay moving any fallback emulation to a layer higher than Solo5. Or an out parameter. Unfortunately it can't be done with a block_info flag as some of the implementations are based on ENOTSUP return codes, which means you don't know ahead of time.

Mar 09 '19 12:03 g2p

With further thought, I'd be good with flags that ask for zeroing or not, in addition to the out parameter and higher-level fallback logic mentioned above. I'm not sure if I would expose the additional flags to the mirage interface. They might be good for future-proofing, I guess, but will trigger breakage on the mirage-block-unix interface.

Mar 09 '19 14:03 g2p

I need it to actually ensure later reads return zeroes.

Ok. It seems to me then that the simplest possible interface at the Solo5 layer is that mentioned in my previous comment (https://github.com/Solo5/solo5/issues/329#issuecomment-470601006), with the addition of a SOLO5_R_EOPNOTSUPP to solo5_result_t, which solo5_block_discard() should return if the host/implementation does not support the operation.

In other words, no emulation, and the application flow becomes:

r = solo5_block_discard(offset, size);
if (r == SOLO5_R_EOPNOTSUPP) {
    /* It is now the application's responsibility to fall back to writing zeroes to (offset, size), if it desires that behaviour */
}

I'd be good with flags that ask for zeroing or not,

I'm not sure what you mean by that, but if you mean adding a "is discard supported" flag to the struct returned by solo5_block_info(), I don't think there is anything the application gains in the above case?

Mar 10 '19 13:03 mato

I like the EOPNOTSUPP approach very much, will do that.

Re an explicit "ask for zeroing" flag (as opposed to discarding and having uninitialized data appear), it's not something I plan to use, and after looking at the complexity of implementing both code paths in virtio, I don't think it's worth it. We'll either offer safe and fast zeroing, or let the application deal.

Mar 10 '19 13:03 g2p

Sounds good to me.

Mar 10 '19 13:03 mato

I've started testing my virtio implementation, but it looks like Qemu doesn't support discard for virtio-blk, only for scsi. I don't know of another supported VMM on Linux so I'll drop that code for now. Still making progress on the spt/hvt backends.

Mar 10 '19 15:03 g2p

crosvm has support for the feature: https://chromium.googlesource.com/chromiumos/platform/crosvm/+/7621d910f56ff85400b252f88fdef324a1cc13d6%5E%21/#F0 via https://bugs.chromium.org/p/chromium/issues/detail?id=850998 .

Not sure how hard it would be to integrate.

Mar 10 '19 16:03 g2p

I've built and run crosvm, but I only get messages from early boot, I don't think it can run Solo5's kernel. So the virtio bits will have to be dropped for now.

Mar 10 '19 17:03 g2p

solo5 solo5 copied to clipboard

block: add support for discard

solo5
solo5 copied to clipboard