firecracker icon indicating copy to clipboard operation
firecracker copied to clipboard

Investigate snapshot restore performance on host kernel >= 5.4

Open serban300 opened this issue 5 years ago • 12 comments

The snapshot restore operation is slower on AMD than on Intel.

We should investigate why and either fix the root cause or change the test_snapshot_resume_latency test in order to check different values for AMD and Intel.

serban300 avatar Sep 11 '20 10:09 serban300

This is not related to AMD. It seems to be related to the host kernel version. On 5.4 I'm getting restore times that revolve around 30ms. On 4.19 I'm getting restore times that revolve around 4 ms. I'm getting similar results both on AMD and Intel.

serban300 avatar Sep 11 '20 14:09 serban300

On host kernel 5.4 looks like the difference between 4ms and 30ms is caused by the jailer.

serban300 avatar Sep 11 '20 17:09 serban300

On host kernel 5.4 the problem seems to be caused by the cgroups. When I start jailer without applying the cgroups the overhead disappears.

serban300 avatar Sep 12 '20 13:09 serban300

The entire overhead comes from the KVM_CREATE_VM ioctl:

When running within the jailer cgroup:

ioctl(12, KVM_CREATE_VM, 0)             = 14 <0.027335>

When running without cgroups:

ioctl(12, KVM_CREATE_VM, 0)             = 14 <0.000352>

The difference is about 27ms.

serban300 avatar Sep 14 '20 14:09 serban300

For the moment I traced the overhead to the kvm_arch_post_init_vm() function in the host kernel. I will dig deeper.

serban300 avatar Sep 15 '20 11:09 serban300

Tracing the overhead further down the host kernel call stack:

kvm_arch_post_init_vm()
-> kvm_mmu_post_init_vm()
-> kvm_vm_create_worker_thread()
-> kvm_vm_worker_thread()
-> cgroup_attach_task_all()
-> cgroup_attach_task()
-> cgroup_migrate()
-> cgroup_migrate_execute()
-> cpuset_can_attach()
-> percpu_down_write(&cpuset_rwsem)

serban300 avatar Sep 18 '20 15:09 serban300

Looks like the overhead was introduced by this kernel patch.

More specifically these 2 commits:

sched/core: Prevent race condition between cpuset and __sched_setscheduler() cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem

serban300 avatar Sep 23 '20 14:09 serban300

Actually, to be even more specific, looks like the overhead was introduced only by the following commit: cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem

I tried to use cpuset_mutex instead of percpu_rwsem and the overhead disappears.

serban300 avatar Sep 24 '20 10:09 serban300

The issue is reproducible on Intel/AMD.

sandreim avatar Sep 25 '20 09:09 sandreim

Just a quick update. I stumbled upon some documentation. Looks like this is how a percpu_rwsem is supposed to work. Quoting from https://github.com/torvalds/linux/blob/master/Documentation/locking/percpu-rw-semaphore.rst

Locking for reading is very fast, it uses RCU and it avoids any atomic instruction in the lock and unlock path. On the other hand, locking for writing is very expensive, it calls synchronize_rcu() that can take hundreds of milliseconds.

serban300 avatar Sep 28 '20 08:09 serban300

I managed to reproduce the issue with this simple rust executable:

use kvm_ioctls::Kvm;
use std::time::{Instant, Duration};
use std::thread;

fn main() {
    thread::sleep(Duration::from_millis(500));

    let kvm = Kvm::new().unwrap();
    let start = Instant::now();
    kvm.create_vm().unwrap();
    let elapsed = start.elapsed();
    println!("Elapsed: {:.2?}", elapsed);
}

and this small script that emulates what jailer does:

#!/bin/bash

cgcreate -g cpuset:/firecracker
echo 0-15 >> /sys/fs/cgroup/cpuset/firecracker/cpuset.cpus
echo 0 >> /sys/fs/cgroup/cpuset/firecracker/cpuset.mems

echo $$ > /sys/fs/cgroup/cpuset/firecracker/tasks && ./test

Note that if we don't add the extra sleep:

    thread::sleep(Duration::from_millis(500));

The issue doesn't reproduce.

serban300 avatar Sep 29 '20 12:09 serban300

When a writer requests access to some shared data, the RCU schedules it on a queue and then waits for at least one so-called "grace-period" to pass. A grace period is a period in witch the RCU waits for all the previously acquired read locks to be released. A grace period ends after all the CPUs go through a quiescent state. Grace periods can be quite long.

When we start the process (echo $$ > /sys/fs/cgroup/cpuset/firecracker/tasks && ./test) a write lock request is being performed and a grace period starts. If we don't add the extra thread::sleep(Duration::from_millis(500));, when we get to kvm.create_vm().unwrap();, the previous grace period is already in progress, so AFAIU the RCU does an optimization where the new lock request is "merged" in the same grace-period with the previous one. So we don't have to wait for a new grace period.

Otherwise, if we do thread::sleep(Duration::from_millis(500));, the grace period generated by the first lock request ends, and kvm.create_vm().unwrap(); asks for a new grace period, ending up waiting 20-30ms.

This is the desired RCU behavior. And it looks like the cgroup_init() code has a related optimization to drive down the boot time. Unfortunately this only works before the first use of the rcu_sync.

I don't think there's any easy way around this issue.

serban300 avatar Oct 06 '20 12:10 serban300

We have tracked down the root cause of this issue to be cgroups v1 implementation on 5.x kernels where x>4.

We were able to replicate the findings from the above investigation and tracked down the latency impact to stem from the cgroup_attach_task_all function. This function is only used by cgroups v1 to attach the current task to all the parent's tasks cgroups. This issue does not have a quick solution since it lies in the host kernel design. And indeed the issue does not replicate on an cgroups v2 enabled host.

  • cgroups V2
ioctl(13, KVM_CREATE_VM, 0)             = 14 <0.000734>
ioctl(13, KVM_CREATE_VM, 0)             = 14 <0.000719>
ioctl(13, KVM_CREATE_VM, 0)             = 14 <0.000750>
  • cgroups V1
ioctl(13, KVM_CREATE_VM, 0)             = 14 <0.026044>
ioctl(13, KVM_CREATE_VM, 0)             = 14 <0.045146>
ioctl(13, KVM_CREATE_VM, 0)             = 14 <0.036045>

Another bad thing we can notice in the V1 measurements is that results also vary a lot from run to run.

Since this is a kernel design originating issue we are currently recommending to users to use the snapshot functionality (on kernels higher than 5.4) on cgroups v2 enabled hosts.

The snapshot resume latency results for 5.10 kernel for the currently supported x86 platforms can be found here: m5d and m6a.

dianpopa avatar Nov 14 '22 14:11 dianpopa