lima
lima copied to clipboard
Network connectivity is quite flaky with some parallel outgoing connections
lima 0.8.1, default networking setup:
When running something that makes bunch of connections (per second) in other window, e.g. fetching www.google.com fails or is quite slow. In practise, lots of apps report network unreachable. It is not matter of bandwidth (the connections do not do much). Switching DNS to useHostResolver: false did not help.
mstenber@lima-f34 ~>curl https://www.google.com -o ,x
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 15246 0 15246 0 0 737 0 --:--:-- 0:00:20 --:--:-- 3168
mstenber@lima-f34 ~>curl https://www.google.com -o ,x
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 15179 0 15179 0 0 26583 0 --:--:-- --:--:-- --:--:-- 26583
^ note ~instant fetch without connections going on
mstenber@lima-f34 ~>curl https://www.google.com -o ,x
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 15187 0 15187 0 0 955 0 --:--:-- 0:00:15 --:--:-- 3342
You can try if using VDE helps ?
https://github.com/lima-vm/vde_vmnet
- #537
It seems that slirp is worse on Mac.
I had similar issues (but not quite) with podman machine too. I suppose it might be even qemu issue of some kind. I'll try the vmnet, it looked bit unnecessary (as all I need is really just container -> outer world connectivity) but if it works better..
Hmm, podman machine uses gvproxy
But maybe slirp still, for internet DNS ?
gvproxy seems to perform much worse in terms of speed (that's why I switched to lima to start with); the weird part is that it is most likely not (only) DNS, it seems that the result is 'network unreachable' from TCP connect().
Seems to work better with the vde.. initially it didn't though, as the slirp was preferred default route still:
mstenber@lima-f34 ~>ip route
default via 192.168.5.2 dev eth0 proto dhcp metric 100
default via 192.168.105.1 dev lima0 proto dhcp metric 101
192.168.5.0/24 dev eth0 proto kernel scope link src 192.168.5.15 metric 100
192.168.105.0/24 dev lima0 proto kernel scope link src 192.168.105.2 metric 101
Needed to run following:
mstenber@lima-f34 ~>sudo ip route delete default via 192.168.5.2
The configuration I had was:
networks:
- lima: shared
Maybe VDE should be packaged for Linux as well ? https://wiki.alienbase.nl/doku.php?id=slackware:vde
There is the root requirement, but maybe if it can be narrowed down to a couple of sudo or suid perhaps.
Maybe VDE should be packaged for Linux as well ?
No, because vde_vmnet doesn't work on Linux.
For Linux we should support TAP with qemu-ifup/qemu-ifdown.
I meant VDE (not vde_vmnet), but I think it's the same TUN/TAP.
But it would be more for performance, not as much for stability.
Much to my surprise VDE is quite slow; in my testing it was 7 times slower than the slirp user mode networking in qemu. And that was after fixing a bug in the vde_bridge code; before it was almost 350 times slower. Timings in https://github.com/virtualsquare/vde-2/pull/35.
So it might be better to do TUN/TAP directly without going through VDE, but just a guess until we benchmark...
Trying to remember what libvirt uses, just know it gets two eth interfaces.
Which in turn is mostly legacy from the VirtualBox setup, NAT + Host-Only.
EDIT:
-netdev tap,fd=35,id=hostnet0,vhost=on,vhostfd=36
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:a4:40:74,bus=pci.0,addr=0x2
-netdev tap,fd=37,id=hostnet1,vhost=on,vhostfd=38
-device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:2e:70:73,bus=pci.0,addr=0x3
qemu-syst 379204 libvirt-qemu 35u CHR 10,200 0t120 140 /dev/net/tun
qemu-syst 379204 libvirt-qemu 36u CHR 10,238 0t0 479 /dev/vhost-net
qemu-syst 379204 libvirt-qemu 37u CHR 10,200 0t120 140 /dev/net/tun
qemu-syst 379204 libvirt-qemu 38u CHR 10,238 0t0 479 /dev/vhost-net
Note: libvirt requires root, so it can set up all kind of virtual bridges on the host
Very non-scientific benchmark but yeah, the VDE perf is pretty bad (most recent master built for the vde*):
slirp:
Fedora-Workstation-Live-aarch64-35-1.2.iso 100% 1831MB 343.3MB/s 00:05
vde:
Fedora-Workstation-Live-aarch64-35-1.2.iso 100% 1831MB 31.6MB/s 00:57
(Both of these are to the host machine, so no real network was harmed during the test)
I only get the desired behaviour on Lima with nerdctl running on user-level, even though the host has same issue.
Every other permutations of the following attempts has also reproduced the behaviour.
- nerdctl at system level
- docker
- host-resolver on/off
- Alpine and Ubuntu as distros
- with/without vde_vmnet
I do notice the difference in nameservers on user and system level nerdctl.
$ nerdctl run --rm -it alpine -- cat /etc/resolv.conf
nameserver 10.0.2.3
nameserver 10.0.2.3
# nerdctl run --rm -it alpine -- cat /etc/resolv.conf
nameserver 192.168.105.1
nameserver 192.168.5.3
I do notice the difference in nameservers on user and system level nerdctl.
$ nerdctl run --rm -it alpine -- cat /etc/resolv.conf nameserver 10.0.2.3 nameserver 10.0.2.3 # nerdctl run --rm -it alpine -- cat /etc/resolv.conf nameserver 192.168.105.1 nameserver 192.168.5.3
I don't actually know how containerd configures DNS (inside k3s it uses coredns; not sure how it works for user mode containerd).
The entries for system mode containerd look suspicious: 192.168.105.1 looks like a nameserver on the local network. How did that make it into the VM? Did you add it via dns in lima.yaml, or did you disable the host resolver?
192.168.5.3 is either a qemu-forwarded nameserver from /etc/resolv.conf on the host, or the lima host resolver (we override the qemu one using iptables).
Either way, you (we) should not have both nameservers in there, as they are no equivalent. Multiple nameservers in /etc/resolv.conf are supported for high availability: if one server is unreachable, the resolver will try another one. But they are all supposed to be interchangable, all resolving all names the same way. There is no fallback from one nameserver to the other. If the first one responds with "domain not found", then that is the final answer.
FWIW, I only get the single expected nameserver when using system mode containerd:
jan@lima-default:~$ sudo nerdctl run --rm -it alpine -- cat /etc/resolv.conf
nameserver 192.168.5.3
FWIW, I only get the single expected nameserver when using system mode containerd:
jan@lima-default:~$ sudo nerdctl run --rm -it alpine -- cat /etc/resolv.conf nameserver 192.168.5.3
Sorry for the confusion, that is with vmnet shared network enabled. Without that, it is same as yours.
I see, thanks! That will be a potential source of problems:
jan@lima-default:~$ sudo nerdctl run --rm -it alpine -- cat /etc/resolv.conf
search home
nameserver 192.168.5.3
nameserver 192.168.105.1
jan@lima-default:~$ resolvectl status
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Link 2 (eth0)
Current Scopes: DNS
Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 192.168.5.3
DNS Servers: 192.168.5.3
Link 3 (lima0)
Current Scopes: DNS
Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 192.168.105.1
DNS Servers: 192.168.105.1
DNS Domain: home
Link 4 (nerdctl0)
Current Scopes: none
Protocols: -DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
It does pick up a search domain from the host, but the simplistic /etc/resolv.conf mechanism doesn't support split-DNS, so all requests should really just go to 127.0.0.53 and the other nameserver should not be listed as a fallback.
The /etc/resolv.conf in the VM is correct though:
jan@lima-default:~$ cat /etc/resolv.conf
[...]
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
[...]
nameserver 127.0.0.53
options edns0 trust-ad
search home
Again, I don't know how nerdctl and/or containerd configure DNS, and why it picks up the incorrect second nameserver.
But you said you get the flaky networking even without the additional vde_vmnet network (and DNS), so it should be immaterial for this issue.
I do not know whether I should append in this issue , but in my case it do nothing with the DNS resolver , even with the pure IP:Port, the network issue seems existed .
In my case , I use lima nerdctl compose up to start 5 services at the same time , and this 5 services both depend on the outside register and discovery service (this is my outgoing traffic,through ip:port) to start . Things get weird that if we start 5 service at the same time , they can not start due to the disconnection or time-out with the outside register service but if we reduce the service to 1 or 2 , it will start occasionally , after some times of retry , even the curl response with timeout .
After I install the vde , the timeout issue goes on . But when I use lima nerdctl run to start container one by one , they can start with no problems . It seems the issue only will appear when I change the stack to lima nerdctl compose , that will request network all at one time . Maybe I can do some network monitoring to track the package size at the start time .
I used raw qemu before and tested several options for networking. I found that vde is really slow compared to user-mode networking. I though we could attach two interfaces to vm: the user-mode device for out traffic and a vde for a connection between vms(if needed).
I used raw qemu before and tested several options for networking. I found that vde is really slow compared to user-mode networking. I though we could attach two interfaces to vm: the user-mode device for out traffic and a vde for a connection between vms(if needed).
I tried out PTP mode and it seems to be more stable for situations like https://github.com/lima-vm/lima/issues/561#issuecomment-1047132046. The download speed is okay and there is also a report of a better dns performance (when used for dns).
But there are still two main issues.
- the upload speed is 90% slower.
- it appears to be incompatible with vpn connections. There are multiple reports of unsuccessful outgoing requests when vpn connection is active.
It looks like the safest best is still your recommendation i.e. to limit vde to providing reachable address to the vm and retaining the user-mode for normal traffic.
Hi all - is this the root cause for the Slirp issues: https://gitlab.freedesktop.org/slirp/libslirp/-/issues/35 ?
Hi all - is this the root cause for the Slirp issues: https://gitlab.freedesktop.org/slirp/libslirp/-/issues/35 ?
that looks like it
Hi all - is this the root cause for the Slirp issues: https://gitlab.freedesktop.org/slirp/libslirp/-/issues/35 ?
Seems already fixed in libslirp v4.5.0 (18 May, 2021): https://gitlab.freedesktop.org/slirp/libslirp/-/merge_requests/73
Seems already fixed in libslirp v4.5.0 (18 May, 2021): https://gitlab.freedesktop.org/slirp/libslirp/-/merge_requests/73
I saw that, however the issue described is near identical to the current issue as well.
This particular one was fixed, but follow-up problem surfaced in 0.12 ( see https://github.com/lima-vm/lima/issues/1285 ). Closing this though.