nix icon indicating copy to clipboard operation
nix copied to clipboard

Nix fails if my local cache (substituer) is offline. Even when everything is available on the next one: cache.nixos.org

Open PaulGrandperrin opened this issue 3 years ago • 9 comments

I have many machines on my local network that are using NixOS and they used to be pulling all of their dependencies from the internet (cache.nixos.org).

Since they are all using the almost same configuration, I setup my NAS to act as a local cache:

  nix.settings = {
    substituters = [
      "http://192.168.1.1:5000"
      "https://cache.nixos.org"
    ];
    trusted-public-keys = [
      "192.168.1.1:QwhwNrClkzxCvdA0z3idUyl76Lmho6JTJLWplKtC2ig="
    ];
  };

It works great, saves a lot of time, bandwidth, and resources on cache.nixos.org. I just need to update the NAS first.

My problem is that when the NAS is unavailable, nix stops working on all my machines. Same issue when I use my laptop outside of my local network.

For example:

$ nix shell nixos#konsole 
warning: error: unable to download 'http://192.168.1.1:5000/kbixjq5b2ddnv1vzj01knvrc5j0cbkyv.narinfo': Couldn't connect to server (7); retrying in 307 ms
warning: error: unable to download 'http://192.168.1.1:5000/kbixjq5b2ddnv1vzj01knvrc5j0cbkyv.narinfo': Couldn't connect to server (7); retrying in 520 ms
warning: error: unable to download 'http://192.168.1.1:5000/kbixjq5b2ddnv1vzj01knvrc5j0cbkyv.narinfo': Couldn't connect to server (7); retrying in 1195 ms
warning: error: unable to download 'http://192.168.1.1:5000/kbixjq5b2ddnv1vzj01knvrc5j0cbkyv.narinfo': Couldn't connect to server (7); retrying in 2116 ms
error: unable to download 'http://192.168.1.1:5000/kbixjq5b2ddnv1vzj01knvrc5j0cbkyv.narinfo': Couldn't connect to server (7)

and the command fails without installing konsole.

Then if I make the NAS available again, nix will successfully see that konsole was not present on the NAS and use cache.nixos.org instead.

This means that the logic to "try the next substituer" is already there, but only works when the error on the first one is a 404 but not when it's a failed connection.

I'd be happy to make a PR if someone could give me a pointer or two about where the offending code is.

PaulGrandperrin avatar Aug 12 '22 09:08 PaulGrandperrin

This would definitely be useful, it's possible to share a store between two computers as well, but because of this both have to be running to prevent failures, which defeats this setup.

rapenne-s avatar Aug 25 '22 09:08 rapenne-s

It would be useful to have an initial ping to the stores at the beginning of a build if downloads are needed, and only ask NARs to stores that are responding

bew avatar Aug 25 '22 09:08 bew

I think https://github.com/NixOS/nix/blob/master/src/libstore/build/substitution-goal.cc#L62 could be the right place to try the substituters before using them

rapenne-s avatar Aug 25 '22 09:08 rapenne-s

Can't that check be done per NAR? I mean, the current logic already works at this level. It would be more robust also in the face of a substituer going down in the middle of a build.

PaulGrandperrin avatar Aug 25 '22 10:08 PaulGrandperrin

Maybe I'm not clear so i rephrase.

Right now we have:

For each NAR, if we get a 404, we try the next substituer.

I would like to change that to:

For each NAR, if we get a 404 or any kind of network error, we try the next substituer

PaulGrandperrin avatar Aug 25 '22 10:08 PaulGrandperrin

This seems to be the most active recent issue, but there are many, many complaints about this. I'd expect that almost everyone trying to run their own local nix-serve runs into this problem eventually. I probably haven't found them all but linked below:

  • https://github.com/NixOS/nix/issues/7127
  • https://github.com/NixOS/nix/issues/2661
    • closed due to inactivity
  • https://github.com/NixOS/nix/issues/4383
  • https://github.com/NixOS/nix/issues/3796

I didn't see a PR so I may try to start tackling this issue myself this week. I see this as the main blocker behind being able to use my own devices as network local caches.

A solution here does need to be more complicated than proposed above though. For example what happens if cachix goes down, as it has done on occasion? Should everybody start rebuilding the world on their own?

So I think we either we need to add a separate list of "optional" substituters, or a flag that we can set to allow them to be unreachable. Personally I think adding an optional substituters list is the best approach, but I'm happy to be persuaded to a different solution.

arcuru avatar Oct 16 '22 19:10 arcuru

A solution here does need to be more complicated than proposed above though. For example what happens if cachix goes down, as it has done on occasion? Should everybody start rebuilding the world on their own?

I was suggesting such a simple solution just to maybe try to implement it myself. I don't know the codebase at all and I'm very rusty with C++.

So I think we either we need to add a separate list of "optional" substituters, or a flag that we can set to allow them to be unreachable. Personally I think adding an optional substituters list is the best approach, but I'm happy to be persuaded to a different solution.

Yes, that seems like a good idea to be able to specify which substituer is required and which one is a cache.

How would that interact with https://nixos.org/manual/nix/stable/command-ref/conf-file.html#conf-fallback ?

PaulGrandperrin avatar Oct 16 '22 20:10 PaulGrandperrin

Urgh, I was overthinking this. Thanks for the pointer to the fallback option, as that sort of does this already. I've explained in the linked PR.

I don't think there's a reason to have "required" and "optional" substituters. We should just be checking everything for a substitute, and fallback to building from source only if fallback = true.

arcuru avatar Oct 18 '22 20:10 arcuru

I think this is entirely a duplicate of #3514, and we should probably only keep one of the two issues open?

me-and avatar Aug 15 '24 10:08 me-and