firecracker
firecracker copied to clipboard
Use `writev` for writing guest outgoing packets to the tap device
Avoid extra memory copy when writing to tap device in virtio-net TX path
Currently, when performing TX, a network frame might be exist in various scattered in various memory regions described by a DescriptorChain
that we receive from the guest. In order to write the frame to the tap device we first perform a number of copies from said memory buffers to a single buffer which we then write to the Tap
file descriptor.
This PR reverts that, to use scatter-gather IO using the writev
system call which avoids the intermediate memory copies.
Fixes #420
Description of Changes
The main blocker for using the writev
system call in our device is that we need to be able to inspect the outgoing frame to check if it is destined towards mmds
. This requires us to read at least the frame headers which makes the code quite convoluted.
In order to do this in a clean way, I introduce an IoVec
struct which is essentially a Vec
of std::io::IoSlice
. The struct can be created from a Descriptor chain (without performing any copies) and can be directly used to perform the writev
system call.
Moreover, it provides methods for reading ranges of bytes from it (which at the moment perform a copy) and this functionality is used to inspect the ranges corresponding to a destination addresses in order to check if the frame is destined for mmds
.
- [ ] This functionality can be added in
rust-vmm
.
License Acceptance
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
PR Checklist
[Author TODO: Meet these criteria.]
[Reviewer TODO: Verify that these criteria are met. Request changes if not]
- [x] All commits in this PR are signed (
git commit -s
). - [x] The issue which led to this PR has a clear conclusion.
- [x] This PR follows the solution outlined in the related issue.
- [x] The description of changes is clear and encompassing.
- [x] Any required documentation changes (code and docs) are included in this PR.
- [x] Any newly added
unsafe
code is properly documented. - [x] Any API changes follow the Runbook for Firecracker API changes.
- [x] Any user-facing changes are mentioned in
CHANGELOG.md
. - [x] All added/changed functionality is tested.
Leaving here for reference some initial performance measurements for the TX path. Baseline, is latest main.
The measurements are percentage improvement comparing to baseline, i.e. (new_version - baseline) / baseline * 100.0)
.
#VMs | Bandwidth | VMM thread CPU load | kworker CPU load* | Bandwidth per CPU load |
---|---|---|---|---|
1 | 15.97 | -3.70 | 50.00 | 19.29 |
2 | 13.38 | -5.52 | 0.00 | 19.35 |
3 | 14.14 | -2.87 | 40.00 | 16.19 |
4 | 16.94 | -3.40 | 12.50 | 20.12 |
5 | 13.94 | -3.70 | 0.00 | 18.08 |
6 | 12.41 | -4.08 | 9.09 | 16.94 |
7 | 11.29 | -6.16 | 25.00 | 17.90 |
8 | 7.08 | -8.54 | 23.08 | 16.49 |
9 | 7.31 | -10.81 | -6.67 | 20.18 |
10 | 1.85 | -12.10 | 6.25 | 15.59 |
11 | 1.56 | -13.50 | -5.26 | 17.08 |
12 | 0.51 | -13.86 | 23.53 | 15.76 |
*kworker load numbers include a lot of noise, since we are measuring the load of all kworker threads on the system.
I also tried modifying the RX path in order to use readv
.
This required changing a bit the order in which we do things in the RX path. The order roughly goes like:
- We get an event from the tap fd
- We try to find an available
DescriptorChain
from the RX queue - We use
readv
to read from the tap directly inside the theDescriptorChain
- Goto 2 until reading from tap returns EAGAIN (or an error occurs)
The implementation works (here is a prototype), however the performance degrades significantly.
I tried to track down the reason, and I took flamegraphs for executions with the test implementation and comparing it against the version with only vectored-writes enabled (this PR).
The flame graph when using only writev
:
and the one when using both writev
and readv
:
In the second case the majority of the time for receiving one packet (Net::read_from_mmds_or_tap
) is consumed by DescriptorChain::checked_new
. This is not true in the former case.
This made me realize, that in the original code we do not parse the whole chain if it is not needed. For example if we are receiving an ICMP frame (54 bytes long) we will parse only one descriptor in the chain. With the new workflow we do not know the size of the packet in advance (tap does not provide us fseek
or something similar) so we will parse the whole chain when getting the descriptor to write in. These descriptors chains are >= 64K bytes long with our current configuration and consist of many descriptors, which causes the increased time consumed in DescriptorChain::checked_new
.
At the moment, I do not see any way to avoid this issue. The only way around this I could think of is trying to ask the tap how many bytes there are available for reading but this functionality doesn't seem to be there.
We will come back to this when we will have some bandwidth.
@aghecenco thanks for the review!!
Regarding:
nit - I'm also sort of in favor of an error type here, given that later on the None branch translates into an error
I'm still in favour of the current variant though I don't have (very very) strong feelings about it. My rationale is that the IoVecBuffer
is irrespective of the device emulation. The fact that it is an error for the device emulation to not be able to read enough bytes it's not an error in the IoVecBuffer
code.
Also, the reason why I didn't make it an Error to find 0 bytes in a buffer is probably due to my Unix conditioning :)
If you guys (@aghecenco, @mattschlebusch) have a strong preference on this, I will change it.
I used the tool of @zulinx86 to quantify the performance change introduced by this PR on all the platforms we support.
TCP throughput results:
kernel | m5d (Sky lake) | m5d (Cascade lake) | m6i | m6gd |
---|---|---|---|---|
4.14 | 3.1% | 4.4% | 7.6% | 13.5% |
5.10 | 3.7% | 5.2% | 7.3% | 16.5% |
network latency results:
kernel | m5d (Sky lake) | m5d (Cascade lake) | m6i | m6gd |
---|---|---|---|---|
4.14 | -9.8% | -1.8% | -0.6% | 0.0% |
5.10 | 0.7% | 1.0% | 0.5% | 2.9% |