terraform-provider-nix icon indicating copy to clipboard operation
terraform-provider-nix copied to clipboard

Deploying nix_nixos can fail silently

Open epigramengineer opened this issue 4 years ago • 9 comments

Adding a non-trivial amount of dependencies to the example nixos configuration makes the deploy silently fail, where it only partially applies. I originally noticed this where only some packages were available after ssh-ing to the host, and only some docker containers specified with virtualisation.oci-containers.containers had been pulled, and none had been started.

So far I've attempted to use virtualisation.googleComputeImage.diskSize = 15000; to increase the disk size to 15gb, but I continue to notice this problem. There is no error messages propagated.

I don't know how to diagnose this problem further.

epigramengineer avatar Jan 29 '21 05:01 epigramengineer

Even with a minimal application deployed, updates such as adding a environment.systemPackages entry for a new pkg that wasn't included in the base image also silently fails.

epigramengineer avatar Jan 29 '21 05:01 epigramengineer

I am confused as to how this can happen, because nixos deployments are atomic, I don't see how it can only install some packages.

insufficient disk space does seem like a likely culprit, i am curious how nixos-rebuild is handling the out of space error.

andrewchambers avatar Jan 29 '21 06:01 andrewchambers

Hmm perhaps the partial application was only applicable to the docker images, which is somewhat explained by an out of space problem. However I am definitely seeing an initial problem with the following:

  1. terraform plan -var google_cloud_project= -out plan && terraform apply plan
  2. ssh to the host and run commands
  3. Update configuration to include the a new systemPackage (such as pkgs.fd)
  4. terraform plan -var google_cloud_project= -out plan && terraform apply plan
  5. ssh to the host, run fd which fails to find the command

I also checked df -h before and after and neither were anywhere near the limit after I upped it to 15G with virtualisation.googleComputeImage.diskSize = 15000.

Are you able to verify with the example and a gcp f1-micro instance?

epigramengineer avatar Feb 01 '21 02:02 epigramengineer

Sorry its going to take me maybe a few weeks at least to get back to this, have a lot of work on my plate at the moment.

andrewchambers avatar Feb 01 '21 02:02 andrewchambers

Thanks for letting me know; I'll update if I find anything useful in the meantime.

epigramengineer avatar Feb 01 '21 02:02 epigramengineer

I figured out why this is happening by running the nixos-rebuild switch --target_host command myself and following the output. The gce server is failing to resolve hosts for the servers such as tarball.nixos.org. After ssh-ing into the host it appears to be failing to resolve all hosts such as those from host www.google.com

Any ideas why dns wouldn't be working? Networking is defined in the google-compute-image.nix and non-dns network operations such as ping 8.8.8.8 succeed.

epigramengineer avatar Feb 01 '21 05:02 epigramengineer

Pinging the host from the google-compute-image failed and the systemd google-network-daemon also has a No route to host error.

Looks like that might be out of date, I'll search around for an answer

epigramengineer avatar Feb 01 '21 05:02 epigramengineer

Looks like that might have been an ephemeral problem. My latest hunch is that since it works with

NIXOS_CONFIG="$(pwd)/gce-deploy.nix" nixos-rebuild switch --target-host root@<ip_addr> but does not work with terraform plan -var google_cloud_project=<project> -out plan && terraform apply plan that it is due to a local environment change that is not picked up in the nix.go code.

I ran into a problem earlier where I needed to explicitly set export TMPDIR="/tmp" to have enough space to build the gce image, so I'm not sure why that would only manifest on the nix_nixos resource and not the nix_build (since the original image works fine)

Just some thoughts. I think I'm going to set this aside for a while but if you have time to look into this more or figure out how to propagate the logging from the commands better let me know.

epigramengineer avatar Feb 02 '21 02:02 epigramengineer

Found this with the TF_LOG=debug turned on:

2021-02-01T21:40:56.875-0800 [DEBUG] plugin.terraform-provider-nix_v0.2.1: 2021/02/01 21:40:56 running [nixos-rebuild build --build-host localhost] in env [...]
2021-02-01T21:40:56.875-0800 [DEBUG] plugin.terraform-provider-nix_v0.2.1: echo "This derivation is not meant to be built, aborting";
...
2021-02-01T21:40:56.890-0800 [DEBUG] plugin.terraform-provider-nix_v0.2.1: 2021/02/01 21:40:56 [INFO] stderr: building Nix...
2021-02-01T21:40:57.872-0800 [DEBUG] plugin.terraform-provider-nix_v0.2.1: 2021/02/01 21:40:57 [INFO] stderr: building the system configuration...
...
2021-02-01T21:41:02.227-0800 [DEBUG] plugin.terraform-provider-nix_v0.2.1: 2021/02/01 21:41:02 running [sh -c exec timeout 10s ssh -o StrictHostKeyChecking=accept-new -o BatchMode=yes [email protected] -- true] in env []
2021-02-01T21:41:03.123-0800 [DEBUG] plugin.terraform-provider-nix_v0.2.1: 2021/02/01 21:41:03 running [sh -c exec ssh -o StrictHostKeyChecking=accept-new -o BatchMode=yes [email protected] -- nix-collect-garbage -d] in env []
2021-02-01T21:41:05.155-0800 [DEBUG] plugin.terraform-provider-nix_v0.2.1: 2021/02/01 21:41:05 [INFO] stderr: removing old generations of profile /nix/var/nix/profiles/system
2021-02-01T21:41:05.157-0800 [DEBUG] plugin.terraform-provider-nix_v0.2.1: 2021/02/01 21:41:05 [INFO] stderr: removing generation 1
2021-02-01T21:41:05.157-0800 [DEBUG] plugin.terraform-provider-nix_v0.2.1: 2021/02/01 21:41:05 [INFO] stderr: removing old generations of profile /nix/var/nix/profiles/per-user/root/channels
2021-02-01T21:41:05.292-0800 [DEBUG] plugin.terraform-provider-nix_v0.2.1: 2021/02/01 21:41:05 [INFO] stderr: finding garbage collector roots...
2021-02-01T21:41:05.445-0800 [DEBUG] plugin.terraform-provider-nix_v0.2.1: 2021/02/01 21:41:05 [INFO] stderr: deleting garbage...
2021-02-01T21:41:05.510-0800 [DEBUG] plugin.terraform-provider-nix_v0.2.1: 2021/02/01 21:41:05 [INFO] stderr: deleting '/nix/store/trash'
2021-02-01T21:41:05.510-0800 [DEBUG] plugin.terraform-provider-nix_v0.2.1: 2021/02/01 21:41:05 [INFO] stderr: deleting unused links...
2021-02-01T21:41:05.524-0800 [DEBUG] plugin.terraform-provider-nix_v0.2.1: 2021/02/01 21:41:05 [INFO] stderr: note: currently hard linking saves -0.00 MiB
2021-02-01T21:41:05.528-0800 [DEBUG] plugin.terraform-provider-nix_v0.2.1: 2021/02/01 21:41:05 [INFO] stdout: 0 store paths deleted, 0.00 MiB freed
2021-02-01T21:41:05.632-0800 [DEBUG] plugin.terraform-provider-nix_v0.2.1: 2021/02/01 21:41:05 running [sh -c exec timeout 10s ssh -o StrictHostKeyChecking=accept-new -o BatchMode=yes [email protected] -- true] in env []

2021-02-01T21:41:07.317-0800 [DEBUG] plugin.terraform-provider-nix_v0.2.1: 2021/02/01 21:41:07 [INFO] stdout: /nix/store/rsxsjdz8slxmkk5jh3l24gzy2bphgh07-nixos-system-unnamed-20.09pre-git

It appears that the DoBuild function is called but it does nothing since it doesn't take in the nixos_config_path, not sure if that is by design. Otherwise the logs seem to imply that they are updating it correctly, but the resulting nix store is the same as after the first apply, even though I've changed the contents of my nixos_config_path=./gce-deploy.nix.

epigramengineer avatar Feb 02 '21 06:02 epigramengineer