firecracker Performance report and some puzzles

Recently, some cpu/memory related benches(lmbench) have been tested on arm baremetal, qemu/kvm VM, and fireracker microvm.
A striking difference from the X86 platform is that most of the benches on microvm outperform baremetal, so amazing!
bench: baremetal/qemu vm/micro vm
pipe: 12.48/14.52/8.07 us
fork+exit: 228/265.4/205.32 us
fork+execve: 751.57/823.28/602.77 us
mmap: 38.87/37.81/18.93 us

As I understand it, there is an overhead of emulate and trap for VM. I feel very confused about this result. Can anyone answer it?

May 28 '22 05:05 thunderZH963

For mmap bench, I carefully analyzed the running time of baremetal and firecracker. I found that the mmap function(the only thing the mmap function really does is change some kernel data structures, and possibly the page table. It doesn't actually put anything into physical memory at all. ) runs slower on firecracker, but faster on firecracker when it actually accesses the data(accessing it will cause a page fault).

May 28 '22 06:05 thunderZH963

Hi @thunderZH963, thanks for reaching out!

We don't currently use lmbench for testing and we don't compare host and guest performance. I know we had some instances in the past where we've seen an application running inside the microVM run faster than on the host.

Though most of the time it appears to be faster due to a measuring error or clock drift/skew in the guest. There are other things to consider as well. Some that I can think of are:

is the application using dynamic libraries of different versions on host and guest?
is the application optimized differently when it's built on the guest vs the host?
is the application scheduled differently or experiencing more issues from noisy neighbours on the host?

Trapping to the hypervisor/vmm only occurs in certain situations and most of the time this happens, there is a significant performance impact on the Guest.

May 31 '22 13:05 alsrdn

Do you think it might be the cache? For example, does firecracker have any special execution cache or may be the different cache size? According to my observation, these benches are worse than the host when the number of iterations is small, but become better when the number of iterations is increased.

Jun 01 '22 01:06 thunderZH963

Hi @thunderZH963, thanks for reaching out!

We don't currently use lmbench for testing and we don't compare host and guest performance. I know we had some instances in the past where we've seen an application running inside the microVM run faster than on the host.

Though most of the time it appears to be faster due to a measuring error or clock drift/skew in the guest. There are other things to consider as well. Some that I can think of are:

is the application using dynamic libraries of different versions on host and guest?

is the application optimized differently when it's built on the guest vs the host?

is the application scheduled differently or experiencing more issues from noisy neighbours on the host?

Trapping to the hypervisor/vmm only occurs in certain situations and most of the time this happens, there is a significant performance impact on the Guest.

ldd can view dynamic libraries, I am sure they are the same.
Does the optimization you mentioned come from the compiler? To avoid this, I use the same executable in host and guest.
Do you have any suggestions for your third question or clock drift/skew? Thank you！

Jun 01 '22 01:06 thunderZH963

My suggestion would be to disable NTP or any form of time synchronisation if the application is using gettimeofday or similar to measure time. Another idea may be to use a counter to get the time elapsed (e.g. Virtual Count Register), but maybe it's not worth changing the benchmark application for this.

Jun 03 '22 15:06 alsrdn

My suggestion would be to disable NTP or any form of time synchronisation if the application is using gettimeofday or similar to measure time. Another idea may be to use a counter to get the time elapsed (e.g. Virtual Count Register), but maybe it's not worth changing the benchmark application for this.

"rdtsc" has been used and proved to be as same as gettimeofday. I haven't tried NTP yet, but what makes me wonder is whether it will make the internal time of firecracker shorter? At the same time, a new discovery is that when there are multiple vcpus (=4), firetracker is generally weaker than baremetal. The previous one vcpu did perform well. From this point of view, can we say that VM CPU scheduling switching costs more? Of course, the question of why one vcpu is good still exists

Jun 03 '22 15:06 thunderZH963

Add an important record: the X86 platform has not found the above problems. The problems that the firecracker has better performance is limited to arm.

Jun 03 '22 15:06 thunderZH963

Hi @thunderZH963, if you are still experiencing this issue, could you please provide us with a minimal reproducible example? E.g. not just the tool used, but also the exact test setup (what parameters? guest kernel? host kernel? CPU model?) and firecracker version? Ideally a script that we can run to verify these findings. Thanks

Oct 09 '23 10:10 roypat

firecracker firecracker copied to clipboard

Performance report and some puzzles

firecracker
firecracker copied to clipboard