lima icon indicating copy to clipboard operation
lima copied to clipboard

Terraform Runs Slowly Inside Lima VM (`read udp 127.0.0.1:36068->127.0.0.53:53: i/o timeout`)

Open bliles opened this issue 2 years ago • 9 comments

Description

Host OS: macOS 12.3.1 Host arch: arm64 Lima version: 0.9.2 Lima VM OS: Ubuntu 21.10 Lima VM arch: arm64

Attempting to run terraform plan or apply inside a lima VM is very slow while terraform calculates the state of the environment. The code repository is cloned inside the VM (so the performance is not related to file sharing between the host OS and the VM). Just to see if I could narrow down the issue, I tried overriding the VM DNS settings in /etc/resolv.conf to my local router instead of 127.0.0.53 and the issue was fixed. Not sure why the systemd-resolved queries would be so much slower. The difference is very noticeable, with terraform plan taking minutes instead of seconds.

bliles avatar Apr 11 '22 15:04 bliles

Did you build lima from the HEAD of the master branch, or are you using the latest release?

There is a bug fix in #773 to the host resolver that could cause timeout issues. Could you compile Lima itself from master and see if this change affects the slowdown you are seeing?

jandubois avatar Apr 11 '22 15:04 jandubois

I was using the released version installed through homebrew. I've cloned the repo and did make && make install for the master branch. limactl now reports: limactl version 0.9.2-20-g8eccbeb. Also I restarted the machine to make sure that no background service was still running the old version.

Now I'm getting a different error:

bliles@lima-default:~/src/terraform/base$ terraform plan ╷ │ Error: error loading state: RequestError: send request failed │ caused by: Get "https://[redacted]/tf-state.json": dial tcp: lookup [redacted] on 127.0.0.53:53: read udp 127.0.0.1:36068->127.0.0.53:53: i/o timeout

If I run a dig command on the host for the same address query time is 63 msec

bliles avatar Apr 11 '22 16:04 bliles

It's almost like DNS resolving gets overwhelmed by the way terraform works (multithreaded queries to the infrastructure APIs). When terraform plan is running I can run dig in a separate shell to the lima VM and get timeouts. But when terraform plan isn't running DNS resolves normally.

bliles avatar Apr 11 '22 16:04 bliles

@bliles we are trying to narrow down the cause of these i/o timeouts. What kind of DNS queries terraform plan is making? is it only A, maybe AAAA and CNAME?? or are there other queries e.g TXT queries that terraform could be making? is there any way to find that out?? this can help us greatly in our process of debugging this issue.

Nino-K avatar Apr 11 '22 17:04 Nino-K

Without setting up wireshark or something, I don't know a way I could tell you for absolute certainty. However, based on that terraform is doing, I think you can assume that most of the time terraform is just using the standard go lib to make HTTPS requests to the AWS APIs. Terraform isn't directly making DNS queries, it just needs to hit various APIs for AWS services in order to query the state of the infrastructure under management.

bliles avatar Apr 11 '22 17:04 bliles

I decided to run another test using a personal project that is running in AWS and is managed with terraform. It is a tiny terraform project that just provides a redirect domain. When I run terraform plan, it generates 10 "Refreshing state..." log lines. This is not enough queries to cause the timeouts. The timeouts are only caused when there are around 20 "Refreshing state..." log lines emitted by terraform. You can see things running normally at first in the output until we hit roughly the 20th log line at which point console output slows to a crawl. Also of note is the fact that terraform works multi-threaded so likely the DNS server is bombarded with several queries at the same time.

bliles avatar Apr 11 '22 18:04 bliles

We are going to try to reproduce the issue by writing a test app that will issue many simultaneous queries.

I have a vague suspicions that the problems might be related to the type of DNS answers provided by AWS though (e.g. longer than usual), so not sure if the local repro is going to work.

If it doesn't, would you be able to scale up your personal terraform project to a size the reproduces the problem, and share the config with us, so we could run it ourselves?

jandubois avatar Apr 11 '22 18:04 jandubois

I will try to give you a repro that I can actually share with you. I tried just adding resources to my personal project and running terraform plan so as not to actually create the resources, but it didn't result in the same behavior.

bliles avatar Apr 11 '22 19:04 bliles

It could be related with this systemd-resoveld.

I am currently facing internal DNS resolution issues!

For example:

  1. CoreDNS memory limits: I needed to remove that constraint.
  2. Running Ubuntu 22 put me in problems with DNS due to the facto of systemd-resolved

I am working on a workaround to update kubelet

rdgacarvalho avatar Jun 17 '22 08:06 rdgacarvalho