firecracker
firecracker copied to clipboard
Performance report and some puzzles
Recently, some cpu/memory related benches(lmbench) have been tested on arm baremetal, qemu/kvm VM, and fireracker microvm.
A striking difference from the X86 platform is that most of the benches on microvm outperform baremetal, so amazing!
bench: baremetal/qemu vm/micro vm
pipe: 12.48/14.52/8.07 us
fork+exit: 228/265.4/205.32 us
fork+execve: 751.57/823.28/602.77 us
mmap: 38.87/37.81/18.93 us
As I understand it, there is an overhead of emulate and trap for VM. I feel very confused about this result. Can anyone answer it?
For mmap bench, I carefully analyzed the running time of baremetal and firecracker. I found that the mmap function(the only thing the mmap function really does is change some kernel data structures, and possibly the page table. It doesn't actually put anything into physical memory at all. ) runs slower on firecracker, but faster on firecracker when it actually accesses the data(accessing it will cause a page fault).
Hi @thunderZH963, thanks for reaching out!
We don't currently use lmbench for testing and we don't compare host and guest performance. I know we had some instances in the past where we've seen an application running inside the microVM run faster than on the host.
Though most of the time it appears to be faster due to a measuring error or clock drift/skew in the guest. There are other things to consider as well. Some that I can think of are:
- is the application using dynamic libraries of different versions on host and guest?
- is the application optimized differently when it's built on the guest vs the host?
- is the application scheduled differently or experiencing more issues from noisy neighbours on the host?
Trapping to the hypervisor/vmm only occurs in certain situations and most of the time this happens, there is a significant performance impact on the Guest.
Do you think it might be the cache? For example, does firecracker have any special execution cache or may be the different cache size? According to my observation, these benches are worse than the host when the number of iterations is small, but become better when the number of iterations is increased.
Hi @thunderZH963, thanks for reaching out!
We don't currently use
lmbenchfor testing and we don't compare host and guest performance. I know we had some instances in the past where we've seen an application running inside the microVM run faster than on the host.Though most of the time it appears to be faster due to a measuring error or clock drift/skew in the guest. There are other things to consider as well. Some that I can think of are:
- is the application using dynamic libraries of different versions on host and guest?
- is the application optimized differently when it's built on the guest vs the host?
- is the application scheduled differently or experiencing more issues from noisy neighbours on the host?
Trapping to the hypervisor/vmm only occurs in certain situations and most of the time this happens, there is a significant performance impact on the Guest.
- ldd can view dynamic libraries, I am sure they are the same.
- Does the optimization you mentioned come from the compiler? To avoid this, I use the same executable in host and guest.
- Do you have any suggestions for your third question or clock drift/skew? Thank you!
My suggestion would be to disable NTP or any form of time synchronisation if the application is using gettimeofday or similar to measure time. Another idea may be to use a counter to get the time elapsed (e.g. Virtual Count Register), but maybe it's not worth changing the benchmark application for this.
My suggestion would be to disable NTP or any form of time synchronisation if the application is using gettimeofday or similar to measure time. Another idea may be to use a counter to get the time elapsed (e.g. Virtual Count Register), but maybe it's not worth changing the benchmark application for this.
"rdtsc" has been used and proved to be as same as gettimeofday. I haven't tried NTP yet, but what makes me wonder is whether it will make the internal time of firecracker shorter? At the same time, a new discovery is that when there are multiple vcpus (=4), firetracker is generally weaker than baremetal. The previous one vcpu did perform well. From this point of view, can we say that VM CPU scheduling switching costs more? Of course, the question of why one vcpu is good still exists
Add an important record: the X86 platform has not found the above problems. The problems that the firecracker has better performance is limited to arm.
Hi @thunderZH963, if you are still experiencing this issue, could you please provide us with a minimal reproducible example? E.g. not just the tool used, but also the exact test setup (what parameters? guest kernel? host kernel? CPU model?) and firecracker version? Ideally a script that we can run to verify these findings. Thanks