Builtin fetching should be representable by derivations
Is your feature request related to a problem? Please describe.
Built-in fetchers block the evaluator. It's not possible to instantiate a derivation until the hash of the fetched source is computed.
Instead of fetching a source (inputSrc) we could represent built-in fetchers using a derivation, similar to builder = "builtins:fetchurl";.
This has the advantage that fetching the input can be
- delayed
- parallelized
- not be done at all, when dependents are substitutable
Describe the solution you'd like
Represent built-in fetching using derivations.
Fetch these outputs using built-in implementations that are not sandboxed, but trusted by virtue of being in Nix (as we do with fetchTree etc)
When scheduling the derivations, request for the Nix client to perform addToStore instead of scheduling an actual derivation build.
Describe alternatives you've considered
A representation of builtin fetchers that doesn't piggyback on derivations. We'd be reinventing the wheel, and interleaving with drv scheduling would be hard.
Additional context
- #9061
nivis more efficient thanflake = falseflakeinputs- Can integrate with #6530 but non-atomic fetching might be harder than in the current situation
- source filter function needs to keep eval alive (as it currently also does, but it doesn't block)
- or just fetch it all when filtering needs to happen (as it currently is with FODs)
- or just let users use ca-derivations to do the filtering
Priorities
Add :+1: to issues you find important.
See also https://github.com/NixOS/nix/issues/4095. There are a few other things like this we should take care of.
The interaction with lazy trees is very interesting. A FUSE store like Tvix has makes it possible to have demand-driven realization / coterminating derivations in general. You can also imagine an RFC 92-like thing where you produce a directory of symlinks to derivations each of which continues the fetch for that subtree. That works very nicely with fixed output derivations and merkle hashing!
CC @flokli
The FUSE idea is interesting, but I would consider it to be a separate set of features.
Considering a typical build where I locally invoke nix build (let's call that client-side) but the derivations are build via the local nix daemon or remote builders (let's call them builders):
Would the built-in fetchers still be executed "client-side" or by the builders?
I am asking because one of the special powers of built-in derivations is their access to client-side authentication. If the execution of the fetches would move to the builder, this would often break.
Would the built-in fetchers still be executed "client-side" or by the builders?
Yes, I would like the client to "join" the builders, although this is not the only way to do it. This exposes the complexity of the hashing trust model. If I get it right off the top of my head, the client can be allowed to build content-addressed paths such as
- content-addressed output derivations
- fixed-output derivations
- builtin fetcher derivations with a verifiable hash (e.g. git tree hash) that's usable as a path hash,
- perhaps builtin fetchers that are trusted, relying on rewriting to make their use safe (ie input addressed derivations need to produce a different hash depending on the result of the trusted builtin fetcher).
Fwiw, it's equivalent to having a remote builder that's not trusted by the store.
one of the special powers of built-in derivations is their access to client-side authentication
Yes, what I'm suggesting will allow the client to perform auth, and without exposing credentials to the daemon. Just like fetching synchronously during eval, but without the synchronicity.
An alternate way to implement this, is to schedule on the client, so that it can choose execute the builtin fetcher derivations on its own, instead of dispatching to the daemon or elsewhere. What's nice about this is that it doesn't require any additions to the daemon's trust logic. Maybe this is the way to go?
content-addressed output derivations are still a bit problematic because while the store object is trustless, the "derivation, output -> content-address" entries are not.
For pure fetching this is no problem because we are doing fixed output derivations. For impure fetching we'd want to e.g. also return the commit object to e.g. proove the commit hash corresponds to a commit with the given tree hash.
I came here from #9570. I've prototyped the effect reading the flake json and feeding the information into nixpkgs.fetch*; Fixing this is a big performance improvement for my use case. Being able to avoid fetching (for me, large) input sources means reducing the fetch time for a substituter-hit from minutes to seconds, along with avoiding the additional network use and disk space usage, both of which are very noticable in terms of the user experience. It also fixes the --dry-run output in a big way by making the source fetches visible (rather than having to wait minutes before you can see the dry run output).
In terms of implementation, is there a big reason fetchtree couldn't become builder = "builtin:fetchtree" (similar to <nix/fetchurl>), which then uses the existing fetchtree logic? It seems that would mean that the existing credentials logic would work out of the box. I assume there is a reason relating to lazy trees that this isn't the current approach.
https://github.com/NixOS/nix/blob/b1c633c6bb2e2dbc8a2891edf0d0c6f9d31d890b/src/libexpr/fetchurl.nix#L19
It seems to me that if fetchTree evaluated to a derivation (with additional metadata) in call-flake it would move fetching to build time:
https://github.com/NixOS/nix/blob/b1c633c6bb2e2dbc8a2891edf0d0c6f9d31d890b/src/libexpr/flake/call-flake.nix#L15
@Ericson2314 also discusses some interaction with cleanSource in https://github.com/NixOS/nix/issues/4095 that I don't fully understand, so maybe there are good reasons that what I've suggested is not a simple/complete approach.
In terms of implementation, is there a big reason fetchtree couldn't become
builder = "builtin:fetchtree"(similar to<nix/fetchurl>) [...]?
Nope! That is exactly what I envision, at least :)
Is that potentially quite doable? I can't promise anything but I might be able to hack that together over the holidays.
Main question is if there are any big gotchas to take care of with the implementation.
@pwaller I expect there to be a few minor hoop to jump though, but no major gotchas. Off the top of my head:
libfetchersis downstream oflibstore. We should make a little "registration" system (see the others) to allowbuiltin:*derivations to be defined in downstream libraries- Not sure how much of the normal sandboxing is or is not carried over to builtin derivations. It could interfere with using the extra caches that
libfetchersuses.
It would be great to have you attempt this!
OK, no promises. If it's after Jan 5th with no updates, assume I'm not able to.
Update: I'm still interested to look at this but no progress. "If it's after Feb 5th with no updates from me, assume I'm not able to". If anyone wants/needs to work-steal, please do so (and say here so I don't waste time).
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
https://discourse.nixos.org/t/how-to-disable-automatic-unpacking-of-nix-flake-inputs/36911/5
@Ericson2314 @roberth : I foresee an issue with making this happen always given the existing fetchTree API.
Assumptions:
- Goal is to make
fetchTreeevaluate to a derivation. - Fetching should happen instead as a part of building the derivation (as opposed to during-eval).
- The derivation object should otherwise be API-compatible with
fetchTreeas it is defined today.- i.e. Any attributes available in on
foogivenfoo = fetchTree { ... }should continue to be available.
- i.e. Any attributes available in on
The issue is that fetchTree evaluates to something which depends on the result of fetching:
https://github.com/NixOS/nix/blob/06be819b896789b39036735acb7d9805f891e014/src/libexpr/primops/fetchTree.cc#L391-L401
In particular, for example given a fetchTree over a git repository specifying a rev, there are the narHash, outPath, lastModified and lastModifiedDate attributes, all of which are unknown until the git repository has been fetched. Basically anything done in emitTreeAttrs (fetchtree.cc) in principle.
I suppose a solution here would be that fetchTree could change its behaviour to [fetch-within-derivation] if these attributes are supplied as inputs (minus outPath which I assume is related to narHash). That would presumably at least cover the case where fetchTree is being called as a part of locked flake evaluation.
It seems to me that it would be unacceptable to change the evaluation result such that these attributes are not available, and there isn't a way to compute them lazily (such that immediate-fetch-within-eval would happen if those attributes are requested as a part of evaluation and are otherwise not supplied).
Current proposal: Make fetchTree evaluate to a derivation if narHash is supplied.
But also in the case of git in particular, other attributes would also have to be supplied up front (at least lastModified, revCount -- others such as lastModifiedDate and shortRev could presumably be computed).
Seem reasonable?
I'm a bit skeptical about conditional behavior which makes such a substantial difference in semantics.
Here's another option for the pile: What if we had two evaluation phases?
- Gather
fetchTreeexpressions, compare them to what's in the lock file, and fetch the sources as referenced if the lock file has gaps. - Evaluate the expressions with all store paths in place.
There's one potential advantage: the concern of actually obtaining the data is cleanly separated out.
I've dabbled with something similar to figure out source references in modules, but the general idea seems to be applicable more widely: https://github.com/fricklerhandwerk/module-sources
@fricklerhandwerk that is what happens with flake inputs, they just happen to be collected into a single easy-to-find-and-evaluate location. It is made much easier due to the restrictions of allowed values. Arbitrary collection of fetchTree values will be fraught with issues regarding one of them depending on the result of another, unless we adopt a similar language subset allowing for trivial eval.
An option I've played with is to allow the URL syntax in angle-brackets: import <github:NixOS/nixpkgs?ref=nixos-unstable> {}, which is interesting because it can be syntactically identified/parsed/evaluated without having to perform a Nix evaluation.
Interesting. Source reference evaluation doesn't have to be trivial or cheap in general though – it should only be cheap if all references are pinned. fetchTree would only return a store path once all lock file parameters it needs are in place. It follows that all remote store paths would have the complete lock file as input. This is probably not great, and could be resolved by splitting the lock file up into multiple.
I don't think that mandating "a canonical location for source references" is a desirable property, but merely a cludge to simplify implementation. As noted in the linked experiment, it completely breaks for modules. In my opinion the canonical location should be the lock file.
Secret third option: add a with derivation flag that the flake support code always passes? Then hard fail if the preconditions aren't met? Unsure how compatible this would be with existing versions if used in user code. I hope extraneous attrs are at most warnings but idk.
Secret third option: add a with derivation flag that the flake support code always passes?
I like this option. I'm interested to prototype this but only if there is some consensus this is a likely accept. It seems though that there is a route to improving the behaviour of flakes for the issue I encountered in #9570 and preserving backwards compatibility.
Also worth considering is a primop to explicitly defer the fetching to a derivation. This will be a more natural solution after the following changes:
- Make
fetchTreereturn a path value instead of a store path string inoutPath. This is closer to lazy-trees. - Make the
fetchTreereturn value not strict in metadata. It should return{ outPath = <lazy path value>; meta = <thunk>; }
This way, something like let src = fetchTree x; in builtins.fetchLater src.outPath does not need to fetch anything at evaluation time, unless the expression also needs src.meta.
If the path value is a filterPath source, it'd still fetch eagerly, because we don't have a way to preserve source filters (or any value/closure) beyond the lifetime of the evaluator.
Furthermore, this ought to be transparent wrt output hashing
- https://github.com/NixOS/nix/issues/9259
ie fetchLater would then be a no-op wrt output hashes, although the .drv hash would be different. Generally nobody cares about .drv hashes, because caching is all about the output hashes. FODs show that those two kinds of hashes are different concepts and we should lean into that, to make fetchLater more transparent.
builtins.fetchGit uses current user's git to fetch. That means it has access to current user's SSH keys, and thus it can fetch private sources through SSH. If it were converted in a derivation, a remote builder wouldn't have access to that (I hope!), and then that derivation would fail. Not sure if it affects this issue, but I think it's worth mentioning it.
@yajo Thank you. This was taken into account:
When scheduling the derivations, request for the Nix client to perform
addToStoreinstead of scheduling an actual derivation build.
"Nix client" here means the context where the evaluator and build scheduler run, and where access to user credentials is possible. (As opposed to "Nix daemon" or even remote builders, where this is indeed more complicated to pull off)