solo5 Enable block operations for more than 1 block

This discussion started in PR #315

The issue is that the solo5 interface is limited to a block at a time and the block is too small. And the problem with that is that it's inefficient as it usually involves too many exits for hvt, or too many syscalls for spt. The following experiment quantifies that. This is the time in seconds to read a 1GB file sequentially for an increasing block size on both hvt and spt (*):

block		hvt		spt
512		5.690s		0.463s
1024		2.881s		0.473s
4096		0.776s		0.171s
8192		0.434s		0.126s
16384		0.259s		0.103s

spt is already pretty fast at 512, but it can be 4x faster by increasing the block size to 8192. This experiment's point is not to increase the block size necessarily, but to allow for multiblock requests instead. FS are already trying to perform IO in larger blocks (usually 4096 Bytes): it would be nice to not split them.

(*) details of the experiment. This is the unikernel: https://gist.github.com/ricarkol/b66c899edd96fd7f8fb2fbaeabad0694#file-solo5-blksize-test-c. This is how the blk size was changed: https://gist.github.com/ricarkol/b66c899edd96fd7f8fb2fbaeabad0694#file-blk-size-diff. Used an Ubuntu 18.04.1 running on an Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz. The disk was stored as a file on an SSD formatted with ext4 (caches were not dropped before each experiment).

Feb 19 '19 19:02 ricarkol

The issue is that the solo5 interface is limited to a block at a time and the block is too small. And the problem with that is that it's inefficient as it usually involves too many exits for hvt, or too many syscalls for spt. The following experiment quantifies that. [...]

The following comments from #315 are also relevant to this discussion:

@ricarkol https://github.com/Solo5/solo5/pull/315#issuecomment-457950289
@cfcs https://github.com/Solo5/solo5/pull/315#issuecomment-458069807

Regarding atomicity guarantees and the "ideal block size" at the solo5_block API level: I'd have to study the problem in depth, so I can't give any informed opinion right now.

So, concentrating on the core of the issue which could be immediately addressed, which is allowing block I/O in multiples of the block size per request: In order to do that, the following needs to be done:

hvt, spt: If the solo5 virtual block device is backed by a file on the host, we need to ensure that the guest cannot write beyond the end of the file. I.e. in any pwrite(fd, buf, count, offset) ensure that (offset + count) <= capacity holds, accounting for overflow(!).
- for hvt, this is straightforward as the tender can do the check in the hypercall implementation
- for spt, I don't think this kind of filter can be expressed with libseccomp, but I may be wrong (google around for code using libseccomp ...). If not, it would imply dropping libseccomp, writing and installing the BPF filter manually which is a fair chunk of work.
virtio: AFAIR our virtio-block code does not support writes of >1 sector, so it would need to be updated to do that, and tested.

Summary: We can enable I/O at >1 block, aligned to the block size at the block layer, but the above points need to be resolved. Regarding atomicity guarantees, I don't think enabling this would make things any "worse" (or less ill-defined) than they are now.

However, someone needs to do the above work. I don't have time for this in the foreseeable future, but am happy to review patches.

Feb 20 '19 08:02 mato

@mato In #341 I just did what you proposed and dropped to BPF. The solution I have there seems to pass basic tests.

Mar 26 '19 07:03 minad

I've thought about this a bit more, especially the implications for spt's seccomp filter in the light of multiple device support. As mentioned previously, the following

in any pwrite(fd, buf, count, offset) ensure that (offset + count) <= capacity holds, accounting for overflow

cannot be expressed using libseccomp.

However, we can express rules that verify:

(count <= X), where X is a multiple of the block size AND
(offset <= (capacity - block_size)).

So, if we were to define the "maximum I/O unit" as, say, X = 16kB, then the most a malicious unikernel could write beyond/extend the end of the file by would be (X - block_size) bytes. While "annoying", this is not exactly a fatal security or DoS issue.

@ricarkol What do you think? What I'm essentially saying is that we punt on trying for a 100% correct solution for now. (See also my comments about hand-written BPF at https://github.com/Solo5/solo5/pull/352#issuecomment-505037870 )

Jun 24 '19 14:06 mato

@mato I think that sounds like a reasonable solution. I'm not too fond of the idea of having to deal with BPF macro code here either.

Some other semi-related points:

Upstreaming a patch to libseccomp (src/gen_bpf.c looks ok-ish to modify) to let it take other syscall arguments as "data"/operands would be useful for a lot of people, and would solve this problem too.
For Linux (where this is relevant for SPT) there's setrlimit(RLIMIT_FSIZE, ..), which doesn't deal with individual files, but can be used to set an upper bound for all accessed files.

Jun 24 '19 18:06 cfcs

@mato, sounds good. And x=16kb seems like a good "maximum I/O unit". Additionally, we could also waste the last 16kbs, and have the second rule as:

(offset <= (capacity - X)).

@cfcs, changing libseccomp would be even better.

Jun 24 '19 18:06 ricarkol

@ricarkol making the accessible amount smaller works well for some cases (preventing the allocation of extra extents; preventing attempts to write past block device limits which will result in an error anyway), but it works very poorly in others (particularly when the unikernel expects an 8K (<=16k) device and finds itself unable to write anything at all).

I think the latter is less intuitive than the former and likely to result in more bug reports, even if this is clearly a trade-off, and the proper solution would be to generate the correct BPF code that we all agree should be there.

Jun 25 '19 00:06 cfcs

@ricarkol:

I agree with @cfcs that "wasting" the last X kB seems like asking for trouble, that's very counter-intuitive. Also, I expect the seccomp issue to be fixed properly at some future point, so would not want to then change the behaviour.

Regarding the actual value of X (maximum I/O unit): I think we should actually make it part of the API (by adding it to struct solo5_block_info and the block device properties). This means that we can allow implementations (bindings) to implement different limits, which in the short term will be useful at least for virtio (where we'd have to re-work the code a bit to allow writes of >1 block). In the longer-term, I can also foresee cases where we would want to have some implementation-defined limit in place.

Regarding the actual "maximum I/O unit" value for hvt and spt, any preferences? Based on your experiments, 4k or 8k seems reasonable.

Jun 25 '19 15:06 mato

@g2p Any comments on this, especially regarding an optimum "maximum I/O unit" size for wodan?

Jun 25 '19 15:06 mato

Wodan would ideally use a much larger IO size of up to 4MiB (the size of a SSD's erase block). I'd be fine with wasting any unaligned end to the device, because Wodan wouldn't use it.

Jun 26 '19 17:06 g2p

@g2p The thing is, as things stand today all our APIs involve copying. So, ignoring the issues with seccomp, a large (e.g. 4MiB) block size is not practical as it would need at least that much fixed buffer space in the Solo5 layer (i.e. per unikernel). If you want to pack large numbers of unikernels on a single server, those numbers would quickly add up. Now, 512 bytes is obviously too small, which is why I suggested a compromise of 4k or 8k.

Jun 27 '19 11:06 mato

#528 improve the situation where, at least, we have an argument which allows us to specify how large the chunk is. However, as @mato pointed out, a question remains about the hvt and the spt tenders where a check must be done about a possible out of bounds access of the block-size. Currently, even with #528, we don't do such check - at least, specially for spt.

I merged #528 because it unlocks some performance issues about our access to block devices. But I will not consider this issue as solved. Moreover, we definitely should go deeper now on this side because it breaks a bit the conservative design of Solo5.

Nov 04 '22 15:11 dinosaure

Hi, I think with #528 we're doing great. The seccomp rules setup allow to pread64/pwrite64 only on the specific file descriptor (that was never questioned), with the length (ARG2) being the block_basic.block_size, and the offset (ARG3) being <= capacity - block_size. Thus we cannot ever read beyond the end of the file (esp. with the additional check in block_attach that the file is actually a multiple of the block size).

Nov 04 '22 15:11 hannesm

solo5 solo5 copied to clipboard

Enable block operations for more than 1 block

solo5
solo5 copied to clipboard