lima icon indicating copy to clipboard operation
lima copied to clipboard

Dropped DNS/UDP requests on user-v2 network

Open fatanugraha opened this issue 1 year ago • 0 comments

Description

I'm experiencing intermittent DNS resolution failure. The issue becomes more prominent after the VM runs for a while (or after running network-heavy related workloads).

I tracked down the root cause and it seems that it caused by a bug in gvisor-tap-vsock (PR)

Opening issue here so we can use the fixed gvisor-tap-vsock after the bug fix is released.

Setup

limactl version 0.23.2 colima version 0.7.3

lima.yaml (created by colima start --vm-type vz)
vmType: vz
arch: aarch64
images:
    - location: /path/to/image.raw
      arch: aarch64
cpus: 4
memory: 8GiB
disk: 60GiB
mounts:
    - location: "~"
      writable: true
    - location: /tmp/colima
      writable: true
mountType: virtiofs
ssh:
    loadDotSSHPubKeys: false
    forwardAgent: false
containerd:
    system: false
    user: false
dns: []
firmware:
    legacyBIOS: false
hostResolver:
    enabled: true
    hosts:
        host.docker.internal: host.lima.internal
portForwards:
    - guestPortRange:
        - 0
        - 0
      guestSocket: /var/run/docker.sock
      hostPortRange:
        - 0
        - 0
      hostSocket: /Users/fata.nugraha/.colima/default/docker.sock
      proto: tcp
    - guestPortRange:
        - 0
        - 0
      guestSocket: /var/run/docker.sock
      hostPortRange:
        - 0
        - 0
      hostSocket: /Users/fata.nugraha/.colima/docker.sock
      proto: tcp
    - guestIPMustBeZero: true
      guestIP: 0.0.0.0
      guestPortRange:
        - 1
        - 65535
      hostIP: 0.0.0.0
      hostPortRange:
        - 1
        - 65535
      proto: tcp
    - guestIP: 127.0.0.1
      guestPortRange:
        - 1
        - 65535
      hostIP: 127.0.0.1
      hostPortRange:
        - 1
        - 65535
      proto: tcp
networks:
    - lima: user-v2
provision:
    - mode: system
      script: sysctl -w fs.inotify.max_user_watches=1048576
    - mode: dependency
      script: groupadd -f docker && usermod -aG docker {{ .User }}
    - mode: system
      script: hostnamectl set-hostname colima
    - mode: system
      script: mount -a
    - mode: system
      script: readlink /usr/sbin/fstrim || fstrim -a

Reproduction steps

The issue will appear when you're reusing the same source ip:addr after 90s of inactivity in between.

Run the code below inside the vm.

package main

import (
	"context"
	"fmt"
	"net"
	"time"
)

func main() {
	r := &net.Resolver{
		PreferGo: true,
		Dial: func(ctx context.Context, network, address string) (net.Conn, error) {
			addr, err := net.ResolveUDPAddr("udp", "0.0.0.0:50406")
			if err != nil {
				panic(err)
			}

			d := net.Dialer{
				Timeout:   time.Millisecond * time.Duration(10000),
				KeepAlive: -1,
				LocalAddr: addr,
			}

			conn, err := d.DialContext(ctx, network, "8.8.8.8:53")
			if err != nil {
				panic(err)
			}

			fmt.Println("LocalAddr: ", conn.LocalAddr())

			return conn, err
		},
	}

	lookup := func() {
		fmt.Printf("%s starting LookupIP\n", time.Now())
		_, err := r.LookupIP(context.Background(), "ip4", "www.google.com")
		if err != nil {
			fmt.Println("err", err)
		} else {
			fmt.Println("ok")
		}
	}

	lookup()                     // ok
	time.Sleep(95 * time.Second) // wait for the UDPConnTimeout
	lookup()                     // this will fail after 2 retries
}

fatanugraha avatar Sep 03 '24 04:09 fatanugraha