nixos-anywhere icon indicating copy to clipboard operation
nixos-anywhere copied to clipboard

`exec` is called, but then ssh connection is lost

Open chessai opened this issue 2 years ago • 11 comments
trafficstars

Describe the bug

I'm trying to use nixos-anywhere to install NixOS onto a machine on my local network. The machine is a brand new System76 Thelio Mega running Ubuntu 22.04. When running nixos-anywhere, everything seems to be going fine, until...

+ echo machine will boot into nixos in 6s...
+ test -e /dev/kmsg
+ exec
Warning: Permanently added '10.144.139.249' (ED25519) to the list of known hosts.
Warning: Permanently added '10.144.139.249' (ED25519) to the list of known hosts.
ssh: connect to host 10.144.139.249 port 22: Connection timed out
ssh: connect to host 10.144.139.249 port 22: Connection timed out
ssh: connect to host 10.144.139.249 port 22: No route to host
ssh: connect to host 10.144.139.249 port 22: No route to host
ssh: connect to host 10.144.139.249 port 22: No route to host
ssh: connect to host 10.144.139.249 port 22: No route to host
ssh: connect to host 10.144.139.249 port 22: No route to host
ssh: connect to host 10.144.139.249 port 22: No route to host
ssh: connect to host 10.144.139.249 port 22: No route to host
^C

...I lost ssh connection to the target, and it is never regained.

Here is the full output.

The target box freezes, and I have to hard reboot. Upon rebooting, nothing seems to have changed.

To Reproduce

Steps to reproduce the behavior:

  1. Have a machine on your local network with Ubuntu 22.04 installed (not sure if it matters if it's in the cloud or not...)
  2. sudo apt install net-tools (for ifconfig)
  3. sudo apt-get install openssh-server, sudo systemctl enable ssh --now
  4. sudo nano /etc/ssh/sshd_config and add the lines Port 22 and PermitRootLogin yes
  5. sudo systemctl start ssh
  6. nix run github:numtide/nixos-anywhere -- --flake github:chessai/thelio-mega#thelio-mega root@<ip of local target>
  7. Wait for it to call exec and then lose connection, then stay disconnected forever.

Expected behavior

Either a successful install, or a clear failure and exit.

System information

Source machine:

❯ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 6.1.25, NixOS, 23.05 (Stoat), 23.05.20230421.2362848`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.13.3`
 - channels(root): `"nixos-22.05pre343321.78cd22c1b86"`
 - channels(chessai): `""`
 - nixpkgs: `/nix/store/22z4n4mxs2vz3l3lg41dz3mgnq1d4wxs-source`

Target machine:

chessai@system76-pc:~$ uname -a
6.2.6-76060206-generic #202303130630~1679424972~22.04~4a8cde1~dev-Ubuntu SMP PREEMPT_DY x86_64 x86_64 x86_64 GNU/Linux

chessai@system76-pc:~$ cat /boot/config-6.2.6-76060206-generic | grep KEXEC
CONFIG_KEXEC=y
CONFIG_KEXEC_FILE=y
CONFIG_ARCH_HAS_KEXEC_PURGATORY=y
CONFIG_KEXEC_SIG=y
# CONFIG_KEXEC_SIG_FORCE is not set
CONFIG_KEXEC_BZIMAGE_VERIFY_SIG=y
CONFIG_KEXEC_JUMP=y
CONFIG_KEXEC_CORE=y
CONFIG_HAVE_IMA_KEXEC=y
CONFIG_IMA_KEXEC=y

Additional context

I stopped X on the machine and re-ran, this time it hit "ssh: connect to host 10.144.139.249 port 22: Connection timed out" and exited immediately.

chessai avatar May 16 '23 03:05 chessai

Can you run kexec directly on the machine, preferrably in a tty and see what the output is? Run this as root:

curl -L https://github.com/nix-community/nixos-images/releases/download/nixos-unstable/nixos-kexec-installer-noninteractive-x86_64-linux.tar.gz | tar -xzf- -C /root
/root/kexec/run

I assume your machine has enough RAM and the kernel also seems reasonable new.

Mic92 avatar May 16 '23 15:05 Mic92

An alternative is also to just boot nixos from a usb stick and than just use nixos-anywhere against the ssh server:

https://github.com/numtide/nixos-anywhere/blob/main/docs/howtos.md#installing-on-a-machine-with-no-operating-system

Mic92 avatar May 16 '23 15:05 Mic92

@chessai do you have a DHCP server running in your local network? I think I had the same issue because the installer got assigned a different IP address. Restarting nixos-anywhere with the IP assigned to the installer worked, although this is not really ideal.

rorosen avatar May 22 '23 19:05 rorosen

The problem was that nixos-anywhere pushes its own kexec image which did not have network access, and the machine was not plugged in to the router with ethernet. I think that this makes sense, I just didn't think about it. For a consistent IP I ended up setting up a DHCP server on the source (where I'm invoking nixos-anywhere) machine, and linking the source and target with ethernet.

chessai avatar May 30 '23 23:05 chessai

@chessai So the machine did not have any network connection, is that what you are saying? Because

For a consistent IP I ended up setting up a DHCP server on the source

...you shouldn't need DHCP in theory. nixos-anywhere dumps static IP addresses and routes before kexec to restore them afterwards. https://github.com/nix-community/nixos-images/blob/main/nix/kexec-installer/restore_routes.py

phaer avatar May 30 '23 23:05 phaer

It sounds like the original machine was connected via wifi and lost wifi connection after kexec. Which is expected since we don't support wifi (yet?)

Lassulus avatar Jun 13 '23 14:06 Lassulus

Wifi is definitely out-of-scope for now. There are too many ways how this can be configured.

Mic92 avatar Jun 13 '23 15:06 Mic92

Not sure if this is the same issue @chessai had, but I ran into similar problem:

nixos-anywhere may fail to connect to the target after kexec if the target network configuration initially used DHCP. The problem is that if the target IP address changes after kexec system switch, the SSH connection to the kexec image would fail. nixos-anywhere restores the original network configuration after kexec switch, but it skips DHCP addresses, which might lead to problems in some cases, such as when installing on local (wired) network where dynamic addresses are used.

One way to fix this issue is to manually configure static address for the target host before running nixos-anywhere.

henrirosten avatar Oct 27 '23 11:10 henrirosten

DHCP addresses are usually re-requested as the installer image has DHCP enabled. There is also always the option to build a custom kexec image that has some standard network configuration builtin (including wifi).

Mic92 avatar Oct 28 '23 08:10 Mic92

I have same issue on both VPS providers I tested. Basically, all you need is connect via VNC and set default gateway: sudo ip route add 10.0.0.1 dev ens3 But it's kinda annoying.

MrFoxPro avatar Jan 02 '24 09:01 MrFoxPro

I experienced the same issue on a cloudcone machine with ubuntu 22.04 and 1G RAM.

snylonue avatar Jul 12 '24 12:07 snylonue

Is there a fix available, because of the ip change?

NikOverflow avatar Dec 23 '24 15:12 NikOverflow

There are mitigations but no silver bullet. There is always the option to boot our nixos installer instead: https://github.com/nix-community/nixos-images?tab=readme-ov-file#iso-installer-images

Mic92 avatar Dec 23 '24 19:12 Mic92

Same problem here using a greencloud vps

LarsOlt avatar Jan 27 '25 01:01 LarsOlt

Closed because this became just a collection issue for people that post "Same here" without actually knowing the actual root cause of the original poster is. This is not very useful or actionable. If some hardware doesn't support kexec, we cannot use nixos-anywhere on with kexec. In that case people have to boot NixOS installer somehow so that nixos-anywhere doesn't have to run kexec. We have an ISO for this: https://github.com/nix-community/nixos-images?tab=readme-ov-file#iso-installer-images If it is about implement better DHCP support after kexec, this should go into a new issue.

Mic92 avatar Jan 27 '25 06:01 Mic92

Closed because this became just a collection issue for people that post "Same here" without actually knowing the actual root cause of the original poster is. This is not very useful or actionable. If some hardware doesn't support kexec, we cannot use nixos-anywhere on with kexec. In that case people have to boot NixOS installer somehow so that nixos-anywhere doesn't have to run kexec. We have an ISO for this: https://github.com/nix-community/nixos-images?tab=readme-ov-file#iso-installer-images If it is about implement better DHCP support after kexec, this should go into a new issue.

I don't understand this reasoning. My hardware did indeed support kexec and yet nixos-anywhere failed. That other people are still mentioning it, nearly 2 years later, means that the issue persists. At the very least the documentation could be more clear.

chessai avatar Jan 27 '25 06:01 chessai

We already have an issue when IP addresses changes due to DHCP after kexec that has a better description: https://github.com/nix-community/nixos-anywhere/issues/415

Mic92 avatar Jan 27 '25 06:01 Mic92

At the very least the documentation could be more clear.

Please do a pull request for that.

Mic92 avatar Jan 27 '25 06:01 Mic92

We already have an issue when IP addresses changes due to DHCP after kexec that has a better description: https://github.com/nix-community/nixos-anywhere/issues/415

Okay, thanks for pointing to this issue. This one makes sense to close then.

chessai avatar Jan 27 '25 07:01 chessai