firecracker
firecracker copied to clipboard
Investigate snapshot restore performance on host kernel >= 5.4
The snapshot restore operation is slower on AMD than on Intel.
We should investigate why and either fix the root cause or change the test_snapshot_resume_latency test in order to check different values for AMD and Intel.
This is not related to AMD. It seems to be related to the host kernel version. On 5.4 I'm getting restore times that revolve around 30ms. On 4.19 I'm getting restore times that revolve around 4 ms. I'm getting similar results both on AMD and Intel.
On host kernel 5.4 looks like the difference between 4ms and 30ms is caused by the jailer.
On host kernel 5.4 the problem seems to be caused by the cgroups. When I start jailer without applying the cgroups the overhead disappears.
The entire overhead comes from the KVM_CREATE_VM ioctl:
When running within the jailer cgroup:
ioctl(12, KVM_CREATE_VM, 0) = 14 <0.027335>
When running without cgroups:
ioctl(12, KVM_CREATE_VM, 0) = 14 <0.000352>
The difference is about 27ms.
For the moment I traced the overhead to the kvm_arch_post_init_vm() function in the host kernel. I will dig deeper.
Tracing the overhead further down the host kernel call stack:
kvm_arch_post_init_vm()
-> kvm_mmu_post_init_vm()
-> kvm_vm_create_worker_thread()
-> kvm_vm_worker_thread()
-> cgroup_attach_task_all()
-> cgroup_attach_task()
-> cgroup_migrate()
-> cgroup_migrate_execute()
-> cpuset_can_attach()
-> percpu_down_write(&cpuset_rwsem)
Looks like the overhead was introduced by this kernel patch.
More specifically these 2 commits:
sched/core: Prevent race condition between cpuset and __sched_setscheduler() cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem
Actually, to be even more specific, looks like the overhead was introduced only by the following commit: cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem
I tried to use cpuset_mutex instead of percpu_rwsem and the overhead disappears.
The issue is reproducible on Intel/AMD.
Just a quick update. I stumbled upon some documentation. Looks like this is how a percpu_rwsem is supposed to work. Quoting from https://github.com/torvalds/linux/blob/master/Documentation/locking/percpu-rw-semaphore.rst
Locking for reading is very fast, it uses RCU and it avoids any atomic instruction in the lock and unlock path. On the other hand, locking for writing is very expensive, it calls synchronize_rcu() that can take hundreds of milliseconds.
I managed to reproduce the issue with this simple rust executable:
use kvm_ioctls::Kvm;
use std::time::{Instant, Duration};
use std::thread;
fn main() {
thread::sleep(Duration::from_millis(500));
let kvm = Kvm::new().unwrap();
let start = Instant::now();
kvm.create_vm().unwrap();
let elapsed = start.elapsed();
println!("Elapsed: {:.2?}", elapsed);
}
and this small script that emulates what jailer does:
#!/bin/bash
cgcreate -g cpuset:/firecracker
echo 0-15 >> /sys/fs/cgroup/cpuset/firecracker/cpuset.cpus
echo 0 >> /sys/fs/cgroup/cpuset/firecracker/cpuset.mems
echo $$ > /sys/fs/cgroup/cpuset/firecracker/tasks && ./test
Note that if we don't add the extra sleep:
thread::sleep(Duration::from_millis(500));
The issue doesn't reproduce.
When a writer requests access to some shared data, the RCU schedules it on a queue and then waits for at least one so-called "grace-period" to pass. A grace period is a period in witch the RCU waits for all the previously acquired read locks to be released. A grace period ends after all the CPUs go through a quiescent state. Grace periods can be quite long.
When we start the process (echo $$ > /sys/fs/cgroup/cpuset/firecracker/tasks && ./test) a write lock request is being performed and a grace period starts. If we don't add the extra thread::sleep(Duration::from_millis(500));, when we get to kvm.create_vm().unwrap();, the previous grace period is already in progress, so AFAIU the RCU does an optimization where the new lock request is "merged" in the same grace-period with the previous one. So we don't have to wait for a new grace period.
Otherwise, if we do thread::sleep(Duration::from_millis(500));, the grace period generated by the first lock request ends, and kvm.create_vm().unwrap(); asks for a new grace period, ending up waiting 20-30ms.
This is the desired RCU behavior. And it looks like the cgroup_init() code has a related optimization to drive down the boot time. Unfortunately this only works before the first use of the rcu_sync.
I don't think there's any easy way around this issue.
We have tracked down the root cause of this issue to be cgroups v1 implementation on 5.x kernels where x>4.
We were able to replicate the findings from the above investigation and tracked down the latency impact to stem from the cgroup_attach_task_all function. This function is only used by cgroups v1 to attach the current task to all the parent's tasks cgroups.
This issue does not have a quick solution since it lies in the host kernel design.
And indeed the issue does not replicate on an cgroups v2 enabled host.
- cgroups V2
ioctl(13, KVM_CREATE_VM, 0) = 14 <0.000734>
ioctl(13, KVM_CREATE_VM, 0) = 14 <0.000719>
ioctl(13, KVM_CREATE_VM, 0) = 14 <0.000750>
- cgroups V1
ioctl(13, KVM_CREATE_VM, 0) = 14 <0.026044>
ioctl(13, KVM_CREATE_VM, 0) = 14 <0.045146>
ioctl(13, KVM_CREATE_VM, 0) = 14 <0.036045>
Another bad thing we can notice in the V1 measurements is that results also vary a lot from run to run.
Since this is a kernel design originating issue we are currently recommending to users to use the snapshot functionality (on kernels higher than 5.4) on cgroups v2 enabled hosts.
The snapshot resume latency results for 5.10 kernel for the currently supported x86 platforms can be found here: m5d and m6a.