nix
nix copied to clipboard
Bug: The build fails if a build machine/cache is offline
Describe the bug
I set up my desktop computer as a build machine and binary cache for my laptop. when I turn off my desktop, every build on my laptop fails.
I can't tell if it is because the desktop is a build machine, or because it's a binary cache, but in both cases this should not be happening.
Steps To Reproduce
- Set up machine B as a build machine and binary cache for machine A
- Turn off machine B
- Run
nix-buildon machine A
Expected behavior
A builds everything itself.
nix-env --version output
$ nix-env --version
nix-env (Nix) 2.3.6
@edolstra How would I go about pushing this forward?
A good first step is to follow the issue template: https://github.com/NixOS/nix/issues/new?assignees=&labels=bug&template=bug_report.md&title=
@zimbatm Is this better?
I marked this as stale due to inactivity. → More info
Still relevant.
@rickynils might be interested in pursuing this since he is working on nixbuild.net
Still relevant.
In Nixos nix.binaryCaches is a list, so hard-failing on a first item being offline is a bug and completely counter-intuitive. Also, error message has to print a suggestion to use --option substituters (or any other currently accepted workaround).
@ZoomRmc can you expand on how to use --option substituters? I don't see anything about it in nixos-rebuild --help.
edit: I'll just add it myself since I found it elsewhere
--option substitute false
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
https://discourse.nixos.org/t/ignore-offline-substituters/15450/4
I can't tell if it is because the desktop is a build machine, or because it's a binary cache, but in both cases this should not be happening.
For me when remote builders are offline, that causes no issue, just a fast:
cannot build on 'ssh://[email protected]': error: cannot connect to '[email protected]': ssh: Could not resolve hostname superfastmachine.local: Name or service not known
And then it continues to build on the local machine, but I previously used ip's instead of hostnames, and then it hung a lot longer before it continued to build on local machine.
But when using binary cache like this:
substituters = http://superfastmachine.local:5000/ https://cache.nixos.org/
in nix.conf, and my pc wants to download something from there, I get:
warning: error: unable to download 'http://superfastmachine.local:5000/wbjfdccsii8wcnawlgg1a72i2vazfg4b.narinfo': Couldn't resolve host name (6); retrying in 336 ms
disabling binary cache 'http://superfastmachine.local:5000' for 60 seconds
error: unable to download 'http://superfastmachine.local:5000/0wxn3wnk6qiv5kzl0w8abv9jzh8szgqz.narinfo': Couldn't resolve host name (6)
error: unexpected end-of-file
And I need to override --option substituters https://cache.nixos.org to exclude the unavailable binary cache to be able to finish the build.
I thought fallback = true in nix.conf would help, but it did not.
I marked this as stale due to inactivity. → More info
unstale bot
Still important.
Related: https://github.com/NixOS/nix/issues/3796, https://github.com/NixOS/nix/issues/6901
I'm pretty sure fallback = true is supposed to fix this, I use this exact setup locally. I'm not sure why that didn't work for @afreakk, maybe a bug that's been fixed now? You'll also need to set connect-timeout = 5 or something else low otherwise the build will hang for minutes, I talked about this in more detail here.
Also related is #7188, which should fix this without needing to set fallback = true.
Setting fallback = true does indeed allow me to build, however this does trigger a stream of error: opening a connection to remote store 'ssh-ng://missing-server' previously failed messages. It'd be nice to provide a way to mark the server as truly optional so that these messages can be avoided.
I set fallback = true like this:
# Setup the SSH keys for the machines we want to build against.
programs.ssh = {
extraConfig = ''
Host missing-server
# <snip>
# Use an aggressive timeout because we're not always on
# the LAN
ConnectTimeout 3
'';
};
nix = {
settings = {
# <snip>
# Private binary cache
substituters = [ "ssh-ng://missing-server" ];
};
extraOptions = ''
# Ensure we can still build when missing-server is not accessible
fallback = true
'';
}
If this isn't a footnuke I don't know what is
A similarly annoying behavior is when you have a private cache that requires an authentication token and that token has expired, builds will fail.
Why can't nix just skip the substituter it can't access?
Not sure what is the current status of the two other prs relating to this, https://github.com/NixOS/nix/pull/7188 and https://github.com/NixOS/nix/pull/8983, is because both of them seem quiet.
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
https://discourse.nixos.org/t/what-is-the-best-practice-to-use-binary-cache-for-this-situation/66212/2
Hopefully fixed by #13301