firecracker icon indicating copy to clipboard operation
firecracker copied to clipboard

Use `writev` for writing guest outgoing packets to the tap device

Open bchalios opened this issue 2 years ago • 2 comments

Avoid extra memory copy when writing to tap device in virtio-net TX path

Currently, when performing TX, a network frame might be exist in various scattered in various memory regions described by a DescriptorChain that we receive from the guest. In order to write the frame to the tap device we first perform a number of copies from said memory buffers to a single buffer which we then write to the Tap file descriptor.

This PR reverts that, to use scatter-gather IO using the writev system call which avoids the intermediate memory copies.

Fixes #420

Description of Changes

The main blocker for using the writev system call in our device is that we need to be able to inspect the outgoing frame to check if it is destined towards mmds. This requires us to read at least the frame headers which makes the code quite convoluted.

In order to do this in a clean way, I introduce an IoVec struct which is essentially a Vec of std::io::IoSlice. The struct can be created from a Descriptor chain (without performing any copies) and can be directly used to perform the writev system call.

Moreover, it provides methods for reading ranges of bytes from it (which at the moment perform a copy) and this functionality is used to inspect the ranges corresponding to a destination addresses in order to check if the frame is destined for mmds.

  • [ ] This functionality can be added in rust-vmm.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

PR Checklist

[Author TODO: Meet these criteria.] [Reviewer TODO: Verify that these criteria are met. Request changes if not]

  • [x] All commits in this PR are signed (git commit -s).
  • [x] The issue which led to this PR has a clear conclusion.
  • [x] This PR follows the solution outlined in the related issue.
  • [x] The description of changes is clear and encompassing.
  • [x] Any required documentation changes (code and docs) are included in this PR.
  • [x] Any newly added unsafe code is properly documented.
  • [x] Any API changes follow the Runbook for Firecracker API changes.
  • [x] Any user-facing changes are mentioned in CHANGELOG.md.
  • [x] All added/changed functionality is tested.

bchalios avatar Apr 12 '22 12:04 bchalios

Leaving here for reference some initial performance measurements for the TX path. Baseline, is latest main. The measurements are percentage improvement comparing to baseline, i.e. (new_version - baseline) / baseline * 100.0).

#VMs Bandwidth VMM thread CPU load kworker CPU load* Bandwidth per CPU load
1 15.97 -3.70 50.00 19.29
2 13.38 -5.52 0.00 19.35
3 14.14 -2.87 40.00 16.19
4 16.94 -3.40 12.50 20.12
5 13.94 -3.70 0.00 18.08
6 12.41 -4.08 9.09 16.94
7 11.29 -6.16 25.00 17.90
8 7.08 -8.54 23.08 16.49
9 7.31 -10.81 -6.67 20.18
10 1.85 -12.10 6.25 15.59
11 1.56 -13.50 -5.26 17.08
12 0.51 -13.86 23.53 15.76

*kworker load numbers include a lot of noise, since we are measuring the load of all kworker threads on the system.

bchalios avatar Apr 19 '22 11:04 bchalios

I also tried modifying the RX path in order to use readv.

This required changing a bit the order in which we do things in the RX path. The order roughly goes like:

  1. We get an event from the tap fd
  2. We try to find an available DescriptorChain from the RX queue
  3. We use readv to read from the tap directly inside the the DescriptorChain
  4. Goto 2 until reading from tap returns EAGAIN (or an error occurs)

The implementation works (here is a prototype), however the performance degrades significantly.

I tried to track down the reason, and I took flamegraphs for executions with the test implementation and comparing it against the version with only vectored-writes enabled (this PR).

The flame graph when using only writev: tx_iovec

and the one when using both writev and readv: rx_iovec

In the second case the majority of the time for receiving one packet (Net::read_from_mmds_or_tap) is consumed by DescriptorChain::checked_new. This is not true in the former case.

This made me realize, that in the original code we do not parse the whole chain if it is not needed. For example if we are receiving an ICMP frame (54 bytes long) we will parse only one descriptor in the chain. With the new workflow we do not know the size of the packet in advance (tap does not provide us fseek or something similar) so we will parse the whole chain when getting the descriptor to write in. These descriptors chains are >= 64K bytes long with our current configuration and consist of many descriptors, which causes the increased time consumed in DescriptorChain::checked_new.

At the moment, I do not see any way to avoid this issue. The only way around this I could think of is trying to ask the tap how many bytes there are available for reading but this functionality doesn't seem to be there.

bchalios avatar May 05 '22 13:05 bchalios

We will come back to this when we will have some bandwidth.

dianpopa avatar Sep 27 '22 08:09 dianpopa

@aghecenco thanks for the review!!

Regarding:

nit - I'm also sort of in favor of an error type here, given that later on the None branch translates into an error

I'm still in favour of the current variant though I don't have (very very) strong feelings about it. My rationale is that the IoVecBuffer is irrespective of the device emulation. The fact that it is an error for the device emulation to not be able to read enough bytes it's not an error in the IoVecBuffer code.

Also, the reason why I didn't make it an Error to find 0 bytes in a buffer is probably due to my Unix conditioning :)

If you guys (@aghecenco, @mattschlebusch) have a strong preference on this, I will change it.

bchalios avatar Nov 07 '22 09:11 bchalios

I used the tool of @zulinx86 to quantify the performance change introduced by this PR on all the platforms we support.

TCP throughput results:

kernel m5d (Sky lake) m5d (Cascade lake) m6i m6gd
4.14 3.1% 4.4% 7.6% 13.5%
5.10 3.7% 5.2% 7.3% 16.5%

network latency results:

kernel m5d (Sky lake) m5d (Cascade lake) m6i m6gd
4.14 -9.8% -1.8% -0.6% 0.0%
5.10 0.7% 1.0% 0.5% 2.9%

bchalios avatar Dec 12 '22 09:12 bchalios