Nix fails if my local cache (substituer) is offline. Even when everything is available on the next one: cache.nixos.org
I have many machines on my local network that are using NixOS and they used to be pulling all of their dependencies from the internet (cache.nixos.org).
Since they are all using the almost same configuration, I setup my NAS to act as a local cache:
nix.settings = {
substituters = [
"http://192.168.1.1:5000"
"https://cache.nixos.org"
];
trusted-public-keys = [
"192.168.1.1:QwhwNrClkzxCvdA0z3idUyl76Lmho6JTJLWplKtC2ig="
];
};
It works great, saves a lot of time, bandwidth, and resources on cache.nixos.org. I just need to update the NAS first.
My problem is that when the NAS is unavailable, nix stops working on all my machines. Same issue when I use my laptop outside of my local network.
For example:
$ nix shell nixos#konsole
warning: error: unable to download 'http://192.168.1.1:5000/kbixjq5b2ddnv1vzj01knvrc5j0cbkyv.narinfo': Couldn't connect to server (7); retrying in 307 ms
warning: error: unable to download 'http://192.168.1.1:5000/kbixjq5b2ddnv1vzj01knvrc5j0cbkyv.narinfo': Couldn't connect to server (7); retrying in 520 ms
warning: error: unable to download 'http://192.168.1.1:5000/kbixjq5b2ddnv1vzj01knvrc5j0cbkyv.narinfo': Couldn't connect to server (7); retrying in 1195 ms
warning: error: unable to download 'http://192.168.1.1:5000/kbixjq5b2ddnv1vzj01knvrc5j0cbkyv.narinfo': Couldn't connect to server (7); retrying in 2116 ms
error: unable to download 'http://192.168.1.1:5000/kbixjq5b2ddnv1vzj01knvrc5j0cbkyv.narinfo': Couldn't connect to server (7)
and the command fails without installing konsole.
Then if I make the NAS available again, nix will successfully see that konsole was not present on the NAS and use cache.nixos.org instead.
This means that the logic to "try the next substituer" is already there, but only works when the error on the first one is a 404 but not when it's a failed connection.
I'd be happy to make a PR if someone could give me a pointer or two about where the offending code is.
This would definitely be useful, it's possible to share a store between two computers as well, but because of this both have to be running to prevent failures, which defeats this setup.
It would be useful to have an initial ping to the stores at the beginning of a build if downloads are needed, and only ask NARs to stores that are responding
I think https://github.com/NixOS/nix/blob/master/src/libstore/build/substitution-goal.cc#L62 could be the right place to try the substituters before using them
Can't that check be done per NAR? I mean, the current logic already works at this level. It would be more robust also in the face of a substituer going down in the middle of a build.
Maybe I'm not clear so i rephrase.
Right now we have:
For each NAR, if we get a 404, we try the next substituer.
I would like to change that to:
For each NAR, if we get a 404 or any kind of network error, we try the next substituer
This seems to be the most active recent issue, but there are many, many complaints about this. I'd expect that almost everyone trying to run their own local nix-serve runs into this problem eventually. I probably haven't found them all but linked below:
- https://github.com/NixOS/nix/issues/7127
- https://github.com/NixOS/nix/issues/2661
- closed due to inactivity
- https://github.com/NixOS/nix/issues/4383
- https://github.com/NixOS/nix/issues/3796
I didn't see a PR so I may try to start tackling this issue myself this week. I see this as the main blocker behind being able to use my own devices as network local caches.
A solution here does need to be more complicated than proposed above though. For example what happens if cachix goes down, as it has done on occasion? Should everybody start rebuilding the world on their own?
So I think we either we need to add a separate list of "optional" substituters, or a flag that we can set to allow them to be unreachable. Personally I think adding an optional substituters list is the best approach, but I'm happy to be persuaded to a different solution.
A solution here does need to be more complicated than proposed above though. For example what happens if cachix goes down, as it has done on occasion? Should everybody start rebuilding the world on their own?
I was suggesting such a simple solution just to maybe try to implement it myself. I don't know the codebase at all and I'm very rusty with C++.
So I think we either we need to add a separate list of "optional" substituters, or a flag that we can set to allow them to be unreachable. Personally I think adding an optional substituters list is the best approach, but I'm happy to be persuaded to a different solution.
Yes, that seems like a good idea to be able to specify which substituer is required and which one is a cache.
How would that interact with https://nixos.org/manual/nix/stable/command-ref/conf-file.html#conf-fallback ?
Urgh, I was overthinking this. Thanks for the pointer to the fallback option, as that sort of does this already. I've explained in the linked PR.
I don't think there's a reason to have "required" and "optional" substituters. We should just be checking everything for a substitute, and fallback to building from source only if fallback = true.
I think this is entirely a duplicate of #3514, and we should probably only keep one of the two issues open?