nix icon indicating copy to clipboard operation
nix copied to clipboard

'Gracefully' fallback from failing substituters

Open philipwilk opened this issue 6 months ago • 3 comments

Motivation

Nix currently hard fails if a substituter is inaccessible, even when they are other substituters available, unless fallback = true. This breaks nix build, run, shell et al entirely. This would modify the default behaviour so that nix would actually use the other available substituters and not hard error.

Here is an example before vs after when using dotenv where I have manually stopped my own cache to trigger this issue, before and after the patch. The initial error is really frustrating because there is other caches available. image image

Context

https://github.com/NixOS/nix/issues/3514#issuecomment-2905056198 is the earliest issue I could find, but there are many duplicates.

There is an initial PR at https://github.com/NixOS/nix/pull/7188, but this appears to have been abandoned - over 2 years with no activity, then a no comment review in jan. There was a subsequent PR at https://github.com/NixOS/nix/pull/8983 but this was closed without merge - over a year without activity.

I have visualised the current and proposed flows. I believe my logic flows line up with what is suggested in https://github.com/NixOS/nix/pull/7188#issuecomment-1375652870 but correct me if I am wrong. Current behaviour: current Proposed behaviour: proposed

Charts in lucid

Possible issues to think about:

  • I could not figure out where the curl error is created... I can't figure out how to swallow it and turn it into a warn or better yet, a debug log.
  • Unfortunately, in contrast with the previous point, I'm not sure how verbose we want to warns/traces to be - personally I think that the warn that a substituter has been disabled (when it happens) is sufficient, and that the next one is being used, but this is personal preference.

Add :+1: to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

philipwilk avatar May 30 '25 21:05 philipwilk

Further consideration: Potentially a configuration option to mark a substituter as silently fail-able/optional - this would solve the issue of whether we want to log fails (some if not all) or not - such as in the use case of a LAN only cache... would still need to figure out how to silence the curl fails though.

philipwilk avatar May 31 '25 00:05 philipwilk

Further consideration: Potentially a configuration option to mark a substituter as silently fail-able/optional - this would solve the issue of whether we want to log fails (some if not all) or not - such as in the use case of a LAN only cache... would still need to figure out how to silence the curl fails though.

Without a json config #8373, I think per-store configuration is nasty to implement (builders/machines config is just list in list) if not impossible, which would also make this currently infeasible...

philipwilk avatar Jun 01 '25 19:06 philipwilk

I think there's a flake (literally) edge case where it still hard errors out in a flake - not sure what is causing this, as I only just tested it working fine... and will need to taken another look at it.

philipwilk avatar Jun 02 '25 02:06 philipwilk

Is this PR still in the works? :) Would love to see this 🧡

Jonas-Sander avatar Jul 09 '25 22:07 Jonas-Sander

currently in the process of moving house so dont have the pc to dev on, but will look back into it asap when available.

philipwilk avatar Jul 10 '25 10:07 philipwilk

~Took another look just now and i think switching the for loops around in Store::querySubstitutablePathInfos is enough to fix it, it seemed to work when i pulled in packages by running nix shell nixpkgs#(name) in the installPhase environment, but I'm struggling a bit to figure out how to patch nix post 2.26 so I can test it on my whole system in different situations.~

(final: prev: {
  nixVersions = prev.nixVersions.extend (
    finalNixV: prevNixV: {
      nixComponents_2_30 =
        (prevNixV.nixComponents_2_30.overrideScope (
          finalScope: prevScope: {
            aws-sdk-cpp = null;
            withAWS = false;
          }
        )).appendPatches
          [ ./ffb2ee8eefc9946a8ebb406cb1fe02734549406d.patch ];
    }
  );
})
EDIT: patch was being flaky, latest version is stable

philipwilk avatar Aug 31 '25 22:08 philipwilk

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/announcing-ncps-a-nix-cache-proxy-server-for-faster-builds/58166/18

nixos-discourse avatar Sep 03 '25 01:09 nixos-discourse

I am reasonably sure that this is the shortest solution.

There is one small edge whereby it can interpret nginx (or whatever your proxy you use in front of your unreliable cache) taking a while to respond on the first request (ex, the moment your cache goes down and nginx is taking exceptionally long trying to connect to it) as a timeout. This is not critical as timeouts are regarded as transient and it tries again momentarily, after which it'll get the 502/504 proxy error and will disable that cache and continue with the next one. This could be solved by increasing the default timeout time for substituters (that way the 502/504 would actually get received the first time around instead of dropped). The main impact this has for users is that a timeout is logged as an error whereas disablement due to a cache being gone is not.

philipwilk avatar Sep 06 '25 03:09 philipwilk

also quick CTA out to interested people to try break this patch rn while nixcon is going on and i have time to look at it (message me on matrix and/or find me on saturday or sunday idk)

philipwilk avatar Sep 06 '25 03:09 philipwilk

@philipwilk nice work!

One small thing, instead of checking whether we're at the end of iterators, I rather you just save the last exception. (You an do this with std::move on a nix::Error, for example, or std exception pointer). Then the logic becomes things like:

  • if beginning loop with a saved exception, log it
  • if left loop with a saved exception because no substituters left, throw.

Also, it might be nice to make the 5XX error handling a separate commit?

Ericson2314 avatar Sep 06 '25 03:09 Ericson2314

@Ericson2314 like that?

philipwilk avatar Sep 06 '25 12:09 philipwilk

Sorry, but are you vibe coding this? I bunch of other stuff changed in this revision that seems orthogonal to what I requested.

Ericson2314 avatar Sep 12 '25 12:09 Ericson2314

no, which part of do you think is vibed? i amended it to be two commits, the first one lets it continue to the next substituter if it fails, but that isnt very useful without the second commit that makes it disabled broken servers (that have 500'd) instead of hard throwing

philipwilk avatar Sep 12 '25 16:09 philipwilk

OK thanks. So I would prefer to do all HTTP status logic in a separate PR, even if the state machine logic is not so useful on its own. Is that OK?

Ericson2314 avatar Sep 12 '25 17:09 Ericson2314

Ah now that I finished revieweing, I feel better too. Sorry about that --- I think we just had a misunderstanding about what exactly I meant with these loops, and so I was like "huh? what is all this?". Hopefully the comments I just left now make sense!

Ericson2314 avatar Sep 12 '25 17:09 Ericson2314

OK thanks. So I would prefer to do all HTTP status logic in a separate PR, even if the state machine logic is not so useful on its own. Is that OK?

put that in https://github.com/NixOS/nix/pull/13971

philipwilk avatar Sep 12 '25 19:09 philipwilk

@philipwilk Thank so much! Last thing is can you put more information in the commit message? (We want the history to be meaningful without GitHub access, though that said, you don't need to inline the contents of the issue. Referring to it by number is fine.)

Ericson2314 avatar Sep 12 '25 21:09 Ericson2314

Actually, maybe I can fix up myself with a squash merge.

Ericson2314 avatar Sep 12 '25 21:09 Ericson2314

Sorry to bump this, but does this fix the linked issues? If so, why are they still open? If not, what is still needed to close them? Thanks!

jakobkukla avatar Sep 25 '25 12:09 jakobkukla

https://github.com/NixOS/nix/pull/7188, https://github.com/NixOS/nix/issues/3514 probably

philipwilk avatar Sep 26 '25 13:09 philipwilk

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/life-of-a-pull-request-against-nixos-nixpkgs/21691/15

nixos-discourse avatar Oct 01 '25 17:10 nixos-discourse