git-lfs support
Hi! This PR works, but there's definitely a better way to support LFS. Looking for maintainer feedback.
Motivation
nix fetches git repos using libgit2, which does not run filters by default. This means LFS-enabled repos can be fetched, but LFS pointer files are not smudged.
This change adds a lfs attribute to fetcher URLs. With lfs=1, when fetching LFS-enabled repos, nix will smudge all the files.
Context
See https://github.com/NixOS/nix/issues/10079.
Git Large File Storage lets you track large files directly in git, using git filters. A clean filter runs on your LFS-enrolled files before push, replacing large files with small "pointer files". Upon checkout, a "smudge" filter replaces pointer files with full file contents. When this works correctly, it is not visible to users, which is nice.
Limitations
Right now, for repos with lfs=1, the LFS-tracked files are materialized during nix flake lock - this is bad, I'm looking for a way to avoid this.
Priorities and Process
Add :+1: to pull requests you find important.
The Nix maintainer team uses a GitHub project board to schedule and track reviews.
Small complication:
it seems that nix flake lock calls fetch which in turn calls Input::fetch -> InputScheme::fetch -> fetchToStore.
In other words, I don't currently see a way to:
- materialize LFS files when fetching
-sourcestore paths, but - don't materialize LFS files during
nix flake lock.
What use case do you have in mind? Isn't LFS typically for large files, that wouldn't usually affect evaluation anyway?
What use case do you have in mind? Isn't LFS typically for large files, that wouldn't usually affect evaluation anyway?
builtins.fetchGit populates the nix store with a <hash>-store path. This path is used as the source when building a derivation. Currently, the builder will see the unsmudged LFS pointer files, but I'd like the builder to optionally see smudged files. I agree that smudged files are usually not needed at eval time, but I don't see a good alternative of making them available at build time, besides a fixed-output derivation (but fixed-output derivations have their own problems)
A FOD seems optimal here, in general you shouldn't use builtins.fetchGit if you're only going to use it at build time.
In general I agree, but (afaik) other fetchers can't use git credentials.
This pull request has been mentioned on NixOS Discourse. There might be relevant details there:
https://discourse.nixos.org/t/2024-03-11-nix-team-meeting-132/42960/1
@roberth have you had a chance to take a look at this issue? We have been staying at older versions of Nix as a workaround but newer versions now have fixes for critical issues so sticking to old ones would no longer be optimal.
Hi @b-camacho, thanks for the ping and sorry for the delay. This PR was assigned to me, but I hadn't prioritized it because it was a draft. Wrong assumption on my end, because I do think this is valuable, and I have some things to say :)
when fetching LFS-enabled repos, nix will smudge all the files.
That's a good start, but we need to make sure that the smudging happens in a controlled manner; otherwise we risk adding impurities.
Specifically, we should parse the attribute to check that they're supposed to be unsmudged by lfs; if not, ignore the smudge rule. It seems you were already investigating how this could be implemented.
Furthermore, we should validate the sha256 so that we don't increase the potential for silent errors by a whole external program. The hash should be easy to parse from the pointer file, and while reading other programs' inputs is a little ad hoc, I don't expect any serious issues from this, as we won't cause users to accidentally rely on a bug this way.
the LFS-tracked files are materialized during
nix flake lock- this is bad
This won't happen unnecessarily either of these are implemented
- #6530
- https://github.com/NixOS/nix/pull/10252
If we need to backtrack on the removal of narHashes (#6530), we can also avoid re-locking transitive inputs whose lock has already been computed by the dependency's lock.
- https://github.com/NixOS/nix/issues/7730
So yes, this isn't efficient yet, but it will be.
A FOD seems optimal here, in general you shouldn't use
builtins.fetchGitif you're only going to use it at build time.
A fixed output derivation works best when all you're using it for is as an input to another derivation (and it's publicly available, as mentioned).
However, if the "fetched" source is a flake (e.g. you have a flake.nix in a repo with LFS files), then you also need to evaluate files from the fetched source, which would constitute import from derivation, which is not optimal. Furthermore you'd need to produce fixed-output hashes for your local repo files, which is such horrible UX we don't need to consider it as a solution.
To summarize, this is worth implementing, I see no blocking issues, design or otherwise, and the following needs to be done:
- [x]
lfsattribute with defaultfalse, LGTM - [ ] figure out which files are LFS
- [ ] invoke the Git LFS filter from
$PATH; no need for a rigid dependency ormakeWrapper - [ ] check the sha256
- [ ] add a test, perhaps extending
tests/nixos/fetch-git - [ ] documentation for the
lfsattribute (currently underfetchGit's entry indoc/manual/src/language/builtins.md); mention the runtime dependency on the LFS package.
What's the state on this PR? Seems to unfortunately be a bit stale given the delayed review. This issue has been plaguing us for a while, so I'm willing to pick up the torch here and try to get this out the door (was actually starting to see how to fix this myself back in March when I saw this PR and decided to see what came out of this).
@kip93 I think your question was directed towards @b-camacho, but I'd like to add that we would welcome and support anyone who'd like to work on this.
Feel free to ask questions here or in the meetings if you can make them. We generally have some agenda, but we also like to make time for contributors during or after, when we often hang out while we get some things done. Link to the video conference is in the scratchpad linked there. We also have a matrix room, although personally I'm guilty of neglecting that one sometimes.
Thanks for the thorough writeup @roberth ! I owe you all an update. To avoid shelling out and implement some features we need to merge, I need a subset of a git-lft C/C++ client. I reimplemented one from Python into C++ here https://github.com/b-camacho/git-lfs-fetch-cpp.
Once I add some tests and integrate git-lfs-fetch-cpp here, we should be ready for another review!
I'm still on vacation with not-great internet, but back in 6 days and will update you all on 7/31 regardless.
Thanks for the feedback and sorry for the wait!
Oh, I don't think shelling out was such a big deal because we can verify the correctness of the result, kind of like how fixed output derivations are allowed to do "grossly impure" things because we can verify the output.
I guess a library implementation of it is still nice for a consistent UX with a small closure size though.
Really we need to reimplement the subset of git we use too.
To a very large degree we've done so. Since #9240 we're using libgit2 so that we use significantly fewer git behaviors by default, such as the overly open ended smudge filters and export-* attributes. We do still use the git command for fetching, which is essentially risk-free as far as reproducibility problems go. It doesn't really matter how the blobs and trees and commits arrive because we can check them thanks to them being content addressed.
We however notably only want shallow fetching (not --depth, but --filter=tree:0). Git doesn't do that well.
Hey! It's me again! I just want to ask if there's anything I can help with here. Maybe I can try and doing some testing, or do a smaller version of this that uses the git-lfs CLI tools while the full implementation gets done?
We have a lot of repos with LFS files that would greatly benefit from this, so I'm willing to do whatever work is needed, but also don't want to add extra work for others where it's not wanted.
Thanks for offering to help @kip93, I'll very happily take you up on that! Here's some background on fetchers (sorry if you already know all this). SourceAccessor abstracts over fs-like entities. Only method we care about is 2 versions of readFile (one reads entire file into memory, the other streams it via callbacks). There are 2 SourceAccessor subclasses for git access:
Brief gitlog perusing does not show why GitSourceAccessor is not rolled into GitExportIgnoreSourceAccessor, but currently they are both used - so we need to make our change to both.
For now, I only added it to GitSourceAccessor, adding this patch on top of your nixpkgs then running /where/you/built/nix/bin/nix-build -A pkgs.test-lfs should give you an outpath with a correctly smudged file.
The major issues right now:
-
readFileinGitSourceAccessorpulls the entire file into memory. this will OOM on many many files, so we need to override the streaming version ofreadFileinstead -
GitExportIgnoreSourceAccessorignores LFS for now - we do not know why
GitExportIgnoreSourceAccessorandGitSourceAccessorboth exist, figuring out exactly why they're both needed would be very helpful
In addition per Robert's comment above, we do still need a SHA check, some tests and docs. Tomorrow I'll investigate GitExportIgnoreSourceAccessor more, but I'm not very familiar with how nix handles git, so it's taking a while.
Ok I don't think we need to touch GitExportIgnoreSourceAccessor! In it's constructor, it takes an instance of SourceAccessor but in reality we always pass in GitSourceAccessor there - so the smudging will be applied either way. Aspects of GitEISA remain a mystery to me, like which constructor is being invoked here At first it looks like we're invoking an inherited constructor because of this using declaration but neither CFSA or FSA define any constructors that take a lambda as second arg, friends I show this to say it shouldn't compile. Yet it does.
Side quest mysteries aside, tomorrow I'll change GitSA::readFile to override the streaming version of SA::readFile.
neither CFSA or FSA define any constructors that take a lambda as second arg, friends I show this to say it shouldn't compile
ended up figuring this out, but it was not very interesting after all
also implemented the streaming readFile impl using Sink and cleaned up git-lfs-fetch.hh. now working on the e2e tests
oh not super important but if you have a sec @roberth - is this the right way to run the e2e nixos tests: nix build --impure '.#hydraJobs.tests.fetch-git' --show-trace ? I keep seeing failures in tests/functional/tarball.sh but AFAIK those should not even be running for the e2e nixos-based tests
right way to run the e2e nixos tests:
nix build --impure '.#hydraJobs.tests.fetch-git' --show-trace?
Yes.
I keep seeing failures in
tests/functional/tarball.sh
The VM tests depend on the nix package, which runs the functional tests, or at least it did until yesterday, so if you rebase, it will skip tarball.sh as a dependency of the VM tests. (nix-cli is from the newly componentized packaging)
You caught me in a really busy week, but hopefully next week I should be able to work with this at my day job, so I should have the time to throw in some real work. Though it seems to me like you've advanced a lot in the last 4 days, so let me know if there's still something I can help with next week (:
Except a URL parsing quirk I think it's ready! Added an e2e test and fixed some parsing, will be back tomorrow to touch it up. Currently the filter attribute matching still uses my homebrew pathspec impl, I also need to replace that with https://libgit2.org/libgit2/#HEAD/group/pathspec/git_pathspec_matches_path
Seems like when you went and asked for help I could not do so and now almost everything seems to be done. It looks to me the least I can do is throw this at our repos with a variety of lfs files to see if at least I can help test that it works as expected. Looking at the nixos tests I can think of an edge case or 2 that usually git just throws warnings about, and I'm not sure how this implementation would deal with those.
Seems like when you went and asked for help I could not do so and now almost everything seems to be done
that's on me, I should have been more responsive when you first offered!
least I can do is throw this at our repos with a variety of lfs files
that would be wonderful, but there is another mystery I would really like your help with @kip93! it seems something is wrong with path caching, but only in nixos tests.
when nix decides whether to fetch a store path or use a cached one, it uses a cache key that looks like this
cacheKey = fetchers::Cache::Key{"fetchToStore", {
{"name", std::string{name}},
{"fingerprint", *path.accessor->fingerprint},
{"method", std::string{method.render()}},
{"path", path.path.abs()}
}};
that fingerprint is what interests us - for git that's the repo rev + attributes like "useSubmodules" or "export-ignore" (and now also lfs). to get the new fingerprint, we simply optionally append ";l" to it when "lfs = true". this part works fine on my machine and in nixos tests.
however, in nixos tests we compute the same store path whether the ";l" suffix is there or not. You can verify this on your machine, pull latest of this branch and run nix build --impure '.#hydraJobs.tests.fetch-git' - I left in some debug prints proving the fingerprints are different. conversely, if you'd like to verify that the paths are different outside of the test, you can run
./outputs/out/bin/nix --extra-experimental-features nix-command eval --impure --raw --expr '(
builtins.fetchGit {
url = "[email protected]:b-camacho/test-lfs.git";
rev = "5536a6ce88a520ab059a04c8249cbe1f9295f7cc";
ref = "main";
lfs = true;
}
).outPath'
./outputs/out/bin/nix --extra-experimental-features nix-command eval --impure --raw --expr '(
builtins.fetchGit {
url = "[email protected]:b-camacho/test-lfs.git";
rev = "5536a6ce88a520ab059a04c8249cbe1f9295f7cc";
ref = "main";
lfs = false;
}
).outPath'
and observe the paths are different.
I don't know why two different fingerprints seem to produce the same cache key, especially why it only happens in the test VM, I'd love your help investigating further (I would probably start by printing the cache key?). the relevant code uses ample inheritance of implementation so it's a little hard to understand for me. Ofc if you're too busy (or would rather not look into this for any reason) lmk and I'll keep trying.
Interesting, I'll certainly have a look and let you know what I find out
Ok, just as an update of my progress (or lack thereof) so far:
- Took a while to parse through the code and try to understand the flow.
- Tried to get the cache key using
std::cerr << "nix::fetchToStore cacheKey:" << fetchers::attrsToJSON(cacheKey.value().second) << std::endl;but it looks ok ({"fingerprint":"8602f7662e242b18d27fdc830150e51d4fb5051b;e","method":"nar","name":"source","path":"/"}vs{"fingerprint":"8602f7662e242b18d27fdc830150e51d4fb5051b;e;l","method":"nar","name":"source","path":"/"}). - Disabled the cache code entirely, still gets the same path in both cases.
- Been going down the rabbit hole ever since, coming up with dead ends. :/
- My only suspect right now is the
treeHashToNarHashfunction ingit-utils, which assumes a NAR hash out of just a rev, but can't see where it's used and that would also fail outside of the VM? All around confusion.
Anyways, not don't have any more time this week to look at it, will try again next week
I got hopeful and went to check treeHashToNarHash but it doesn't get called when running in the VM, I'm equally stumped
Figured it out!
The storepath (ie the nix sha256) is not calculated from the fingerprint, but from contents. With --debug it actually says the cache missed
nix eval --debug --impure --raw --expr '(
builtins.fetchGit {
url = "http://gitea:3000/test/lfs";
rev = "755514e46a902f778e4ea9a087e3cafb970c1810";
ref = "main";
lfs = true;
}
).outPath'
evaluating file '<nix/derivation-internal.nix>'
locking path '/root/.cache/nix/gitv3/0iq989hd50hql2bp8bdpn06lddbhpd2fjl21xw9qw0drrr4l5yrr'
lock acquired on '/root/.cache/nix/gitv3/0iq989hd50hql2bp8bdpn06lddbhpd2fjl21xw9qw0drrr4l5yrr.lock'
lock released on '/root/.cache/nix/gitv3/0iq989hd50hql2bp8bdpn06lddbhpd2fjl21xw9qw0drrr4l5yrr.lock'
using cache entry 'gitLastModified:{"rev":"755514e46a902f778e4ea9a087e3cafb970c1810"}' -> '{"lastModified":1731733520}'
using cache entry 'gitRevCount:{"rev":"755514e46a902f778e4ea9a087e3cafb970c1810"}' -> '{"revCount":1}'
using revision 755514e46a902f778e4ea9a087e3cafb970c1810 of repo 'http://gitea:3000/test/lfs'
smudgeLfs: 1
getFingerprint: 755514e46a902f778e4ea9a087e3cafb970c1810;e;l
did not find cache entry for 'fetchToStore:{"fingerprint":"755514e46a902f778e4ea9a087e3cafb970c1810;e;l","method":"nar","name":"source","path":"/","store":"/nix/store"}'
copying '«git+http://gitea:3000/test/lfs?exportIgnore=1&ref=main&rev=755514e46a902f778e4ea9a087e3cafb970c1810»«unknown»/' to the store...
acquiring write lock on '/nix/var/nix/temproots/1317'
/nix/store/c5a3w5jsfsrh812gdvzljgg21ccnin3a-source
So why was it not smudging the file? Before smudging a pointer file we check if it has a filter=lfs attribute in .gitattributes. .gitattributes contains lines that look like this: *.big_ext filter=lfs..., that first part is the "pathspec" (=~glob pattern in libgit2 lingo). When I was checking if the pathspec matches the file path, I was using an absolute, instead of relative path and the leading / broke the pathspec matcher.
easy fix in retrospect! Thanks so much for your help running this down @kip93!
Separately, I cleaned up the patchspec matching code and added more tests.
I think we're ready for a review, lmk what else we need @roberth!
Nice! I did not look into the smudging logic itself, so figures I did not see that u.u
Anyway, then I'll proceed to do some manual testing (I think I can also add a new test or 2).
Also, while waiting for compilation here and there, I wrote some small docs that are probably necessary to add. Not sure if I can make a PR into a PR, but here's the commit in my own repo: b48dacd5