lima icon indicating copy to clipboard operation
lima copied to clipboard

Guest agent sometimes becomes unavailable from the host

Open roman-kiselenko opened this issue 1 year ago • 4 comments

Description

I've found this abiosoft/colima/issues/994 and it looks exactly like mine.

After sometime the created VM started to loose connection to guestagent due to missing ga.sock.

journalctl -e -u lima-guestagent.service -f
-- Boot 3c14c00ca1b045588ca9d3a840437668 --
Mar 02 15:34:31 lima-cri-tools systemd[1]: Started lima-guestagent.
Mar 02 15:34:33 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:34:33+03:00" level=info msg="event tick: 3s"
Mar 02 15:34:33 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:34:33+03:00" level=info msg="serving the guest agent on \"/run/lima-guestagent.sock\""
Mar 02 15:34:36 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:34:36+03:00" level=debug msg="fixSystemTimeSkew: rtc=2024-03-02T12:34:36Z systime=2024-03-02T15:34:36+03:00 delta=-150.01032ms"
Mar 02 15:34:46 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:34:46+03:00" level=debug msg="fixSystemTimeSkew: rtc=2024-03-02T12:34:46Z systime=2024-03-02T15:34:46+03:00 delta=-152.18532ms"
Mar 02 15:34:56 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:34:56+03:00" level=debug msg="fixSystemTimeSkew: rtc=2024-03-02T12:34:56Z systime=2024-03-02T15:34:56+03:00 delta=-155.54332ms"
Mar 02 15:35:06 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:06+03:00" level=debug msg="fixSystemTimeSkew: rtc=2024-03-02T12:35:06Z systime=2024-03-02T15:35:06+03:00 delta=-151.786742ms"
Mar 02 15:35:16 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:16+03:00" level=debug msg="fixSystemTimeSkew: rtc=2024-03-02T12:35:16Z systime=2024-03-02T15:35:16+03:00 delta=-146.093647ms"
Mar 02 15:35:21 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:21+03:00" level=debug msg="setWorthCheckingIPTablesRoutine(): setting to true"
Mar 02 15:35:21 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:21+03:00" level=debug msg="setWorthCheckingIPTablesRoutine(): setting to true"
Mar 02 15:35:21 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:21+03:00" level=debug msg="setWorthCheckingIPTablesRoutine(): setting to true"
Mar 02 15:35:21 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:21+03:00" level=debug msg="setWorthCheckingIPTablesRoutine(): setting to true"
Mar 02 15:35:21 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:21+03:00" level=debug msg="setWorthCheckingIPTablesRoutine(): setting to true"
Mar 02 15:35:26 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:26+03:00" level=debug msg="fixSystemTimeSkew: rtc=2024-03-02T12:35:26Z systime=2024-03-02T15:35:26+03:00 delta=-149.117589ms"
Mar 02 15:35:27 lima-cri-tools systemd[1]: Stopping lima-guestagent...
Mar 02 15:35:27 lima-cri-tools systemd[1]: lima-guestagent.service: Deactivated successfully.
Mar 02 15:35:27 lima-cri-tools systemd[1]: Stopped lima-guestagent.
Mar 02 15:35:27 lima-cri-tools systemd[1]: Started lima-guestagent.
Mar 02 15:35:28 lima-cri-tools lima-guestagent[2425]: time="2024-03-02T15:35:28+03:00" level=info msg="event tick: 3s"
Mar 02 15:35:28 lima-cri-tools lima-guestagent[2425]: time="2024-03-02T15:35:28+03:00" level=info msg="serving the guest agent on \"/run/lima-guestagent.sock\""
Mar 02 15:35:34 lima-cri-tools lima-guestagent[2425]: time="2024-03-02T15:35:34+03:00" level=debug msg="LocalPorts(): worthCheckingIPTables=true"
Mar 02 15:35:34 lima-cri-tools lima-guestagent[2425]: time="2024-03-02T15:35:34+03:00" level=debug msg="Get info %+v{[{0.0.0.0 22} {127.0.0.1 35967} {127.0.0.53 53} {:: 22}]}"

The logs above shows that the lima-guestagent.service has been restarted (🤔 not sure the reason). This leads to the missing connection from the host:

WARN[0110] [hostagent] connection to the guest agent was closed unexpectedly

Missing connection leads to ga.sock deleting.

I tried to comment ga.sock deleting, and it appears to be a fix.

diff
diff --git a/pkg/hostagent/hostagent.go b/pkg/hostagent/hostagent.go
index 92c5073..ee4c9a3 100644
--- a/pkg/hostagent/hostagent.go
+++ b/pkg/hostagent/hostagent.go
@@ -605,7 +605,7 @@ func (a *HostAgent) watchGuestAgentEvents(ctx context.Context) {
                                }
                        }
                } else {
-                       logrus.WithError(err).Warn("connection to the guest agent was closed unexpectedly", err)
+                       logrus.WithError(err).Warn("connection to the guest agent was closed unexpectedly\n", err)
                }
                select {
                case <-ctx.Done():
@@ -728,13 +728,14 @@ func forwardSSH(ctx context.Context, sshConfig *ssh.SSHConfig, port int, local,
                                }
                        } else {
                                logrus.Infof("Forwarding %q (guest) to %q (host)", remote, local)
-                               if err := os.RemoveAll(local); err != nil {
-                                       logrus.WithError(err).Warnf("Failed to clean up %q (host) before setting up forwarding", local)
-                               }
-                       }
-                       if err := os.MkdirAll(filepath.Dir(local), 0o750); err != nil {
-                               return fmt.Errorf("can't create directory for local socket %q: %w", local, err)
+                               logrus.Infof("Delete files %q (local)", local)
+                               // if err := os.RemoveAll(local); err != nil {
+                               //      logrus.WithError(err).Warnf("Failed to clean up %q (host) before setting up forwarding", local)
+                               // }
                        }

hostagent reconnect and everything is working again.

limactl start crio-tools
.....
INFO[0106] [hostagent] Waiting for the guest agent to be running
INFO[0106] [hostagent] Forwarding "/run/lima-guestagent.sock" (guest) to "/Users/user/.lima/cri-tools/ga.sock" (host)
INFO[0106] [hostagent] Delete files "/Users/user/.lima/cri-tools/ga.sock" (local)
WARN[0106] [hostagent] Failed to set up forward from "/run/lima-guestagent.sock" (guest) to "/Users/user/.lima/cri-tools/ga.sock" (host)
WARN[0106] [hostagent] connection to the guest agent was closed unexpectedly
dial unix /Users/user/.lima/cri-tools/ga.sock: connect: no such file or directory
INFO[0116] [hostagent] Forwarding "/run/lima-guestagent.sock" (guest) to "/Users/user/.lima/cri-tools/ga.sock" (host)
INFO[0116] [hostagent] Delete files "/Users/user/.lima/cri-tools/ga.sock" (local)
INFO[0116] [hostagent] Guest agent is running
INFO[0116] [hostagent] Waiting for the final requirement 1 of 1: "boot scripts must have finished"
INFO[0116] [hostagent] Not forwarding TCP [::]:22
INFO[0116] [hostagent] Not forwarding TCP 127.0.0.53:53
INFO[0116] [hostagent] Forwarding TCP from 127.0.0.1:44569 to 127.0.0.1:44569
INFO[0116] [hostagent] Not forwarding TCP 0.0.0.0:22
INFO[0116] [hostagent] The final requirement 1 of 1 is satisfied
INFO[0116] READY. Run `limactl shell cri-tools` to open the shell.

I see two options here (the second one is better, I guess, but I can't find the reason):

  1. In case of missing connection to the guestagent trying to reconnect and don't delete ga.sock
  2. Find a reason why lima-guestagent.service is failing.
lima.yml
vmType: null
os: null
arch: x86_64
images:
  - location: "https://cloud-images.ubuntu.com/releases/22.04/release-20240125/ubuntu-22.04-server-cloudimg-amd64.img"
    arch: "x86_64"
    digest: "sha256:f23e5b91af7f52bdd4680d8e9d091093bfc6c4a46b0da7baa3cb450ee917b66b"
cpus: 6
memory: 8GiB
disk: null
mounts:
  - location: "~/Documents/dev/opensource/cri-tools"
    mountPoint: null
    writable: true
    sshfs:
      cache: null
      followSymlinks: null
      sftpDriver: null
    9p:
      securityModel: null
      protocolVersion: null
      msize: null
      cache: "mmap"
  - location: "/tmp/lima"
    writable: true
mountType: 9p
additionalDisks:
ssh:
  localPort: 0
  loadDotSSHPubKeys: null
  forwardAgent: null
  forwardX11: null
  forwardX11Trusted: null
caCerts:
  removeDefaults: null
  files:
  certs:
upgradePackages: true
containerd:
  system: false
  user: false
cpuType:
  aarch64: null
  armv7l: null
  x86_64: null
rosetta:
  enabled: null
  binfmt: null
timezone: null
firmware:
  legacyBIOS: null
audio:
  device: null
video:
  display: null
  vnc:
    display: null
networks:
propagateProxyEnv: null
hostResolver:
  enabled: null
  ipv6: null
  hosts:
guestInstallPrefix: null
plain: null
> limactl --version
limactl version 0.20.1
> qemu-system-x86_64 --version
QEMU emulator version 8.2.1
Copyright (c) 2003-2023 Fabrice Bellard and the QEMU Project developers

roman-kiselenko avatar Mar 02 '24 12:03 roman-kiselenko

Is this issue specific to colima?

Do you see the issue with a plain Lima instance?

AkihiroSuda avatar Mar 05 '24 01:03 AkihiroSuda

Is this issue specific to colima?

Do you see the issue with a plain Lima instance?

This behaviour is specific to lima.

roman-kiselenko avatar Mar 05 '24 07:03 roman-kiselenko

Another related issue https://github.com/abiosoft/colima/issues/999#issuecomment-1987873799

roman-kiselenko avatar Mar 11 '24 08:03 roman-kiselenko

ref https://github.com/lima-vm/lima/issues/2064

roman-kiselenko avatar Apr 15 '24 07:04 roman-kiselenko