lima
lima copied to clipboard
Terraform Runs Slowly Inside Lima VM (`read udp 127.0.0.1:36068->127.0.0.53:53: i/o timeout`)
Description
Host OS: macOS 12.3.1 Host arch: arm64 Lima version: 0.9.2 Lima VM OS: Ubuntu 21.10 Lima VM arch: arm64
Attempting to run terraform plan or apply inside a lima VM is very slow while terraform calculates the state of the environment. The code repository is cloned inside the VM (so the performance is not related to file sharing between the host OS and the VM). Just to see if I could narrow down the issue, I tried overriding the VM DNS settings in /etc/resolv.conf to my local router instead of 127.0.0.53 and the issue was fixed. Not sure why the systemd-resolved queries would be so much slower. The difference is very noticeable, with terraform plan taking minutes instead of seconds.
Did you build lima from the HEAD of the master
branch, or are you using the latest release?
There is a bug fix in #773 to the host resolver that could cause timeout issues. Could you compile Lima itself from master
and see if this change affects the slowdown you are seeing?
I was using the released version installed through homebrew. I've cloned the repo and did make && make install
for the master branch. limactl now reports: limactl version 0.9.2-20-g8eccbeb
. Also I restarted the machine to make sure that no background service was still running the old version.
Now I'm getting a different error:
bliles@lima-default:~/src/terraform/base$ terraform plan ╷ │ Error: error loading state: RequestError: send request failed │ caused by: Get "https://[redacted]/tf-state.json": dial tcp: lookup [redacted] on 127.0.0.53:53: read udp 127.0.0.1:36068->127.0.0.53:53: i/o timeout
If I run a dig command on the host for the same address query time is 63 msec
It's almost like DNS resolving gets overwhelmed by the way terraform works (multithreaded queries to the infrastructure APIs). When terraform plan is running I can run dig in a separate shell to the lima VM and get timeouts. But when terraform plan isn't running DNS resolves normally.
@bliles we are trying to narrow down the cause of these i/o timeouts. What kind of DNS queries terraform plan is making? is it only A, maybe AAAA and CNAME?? or are there other queries e.g TXT queries that terraform could be making? is there any way to find that out?? this can help us greatly in our process of debugging this issue.
Without setting up wireshark or something, I don't know a way I could tell you for absolute certainty. However, based on that terraform is doing, I think you can assume that most of the time terraform is just using the standard go lib to make HTTPS requests to the AWS APIs. Terraform isn't directly making DNS queries, it just needs to hit various APIs for AWS services in order to query the state of the infrastructure under management.
I decided to run another test using a personal project that is running in AWS and is managed with terraform. It is a tiny terraform project that just provides a redirect domain. When I run terraform plan, it generates 10 "Refreshing state..." log lines. This is not enough queries to cause the timeouts. The timeouts are only caused when there are around 20 "Refreshing state..." log lines emitted by terraform. You can see things running normally at first in the output until we hit roughly the 20th log line at which point console output slows to a crawl. Also of note is the fact that terraform works multi-threaded so likely the DNS server is bombarded with several queries at the same time.
We are going to try to reproduce the issue by writing a test app that will issue many simultaneous queries.
I have a vague suspicions that the problems might be related to the type of DNS answers provided by AWS though (e.g. longer than usual), so not sure if the local repro is going to work.
If it doesn't, would you be able to scale up your personal terraform project to a size the reproduces the problem, and share the config with us, so we could run it ourselves?
I will try to give you a repro that I can actually share with you. I tried just adding resources to my personal project and running terraform plan so as not to actually create the resources, but it didn't result in the same behavior.
It could be related with this systemd-resoveld.
I am currently facing internal DNS resolution issues!
For example:
- CoreDNS memory limits: I needed to remove that constraint.
- Running Ubuntu 22 put me in problems with DNS due to the facto of systemd-resolved
I am working on a workaround to update kubelet