Pruning automatically
While there is existing discussion on trying to prevent LFS from keeping two copies of files (https://github.com/git-lfs/git-lfs/issues/1969 and https://github.com/git-lfs/git-lfs/issues/2147), running git lfs prune automatically could mitigate the issue for most users.
One way to do this right now is to add git lfs prune in the pre-push hook, but that's inconvenient. Adding an option to automatically do this would be more user-friendly. Thoughts?
Hey, thanks for writing in.
I will say I'm a bit nervous about putting in an automatic prune capability, simply because if something goes wrong, it's really easy to lose data. It's already not uncommon for people to make mistakes which cause them to lose their LFS objects, and making data loss easier for people is not something I want to do.
Also, it isn't actually possible to remove all of the objects at the pre-push stage anyway because some of the Git objects haven't been pushed yet, and we don't allow pruning LFS objects that are referenced by unpushed Git objects (because the server might not have them yet). There isn't a good hook point for us to automatically remove such objects, although adding it to the pre-push hook would prune most of the objects. I do, however, think that adding it to the hook is probably the right way forward for people who are very certain that they know what they're doing and want to minimize the number of LFS objects they have on disk.
The problem I'm facing is that we have artists that commit huge files frequently and it fills up their hard drive rapidly. The "solution" is to have them run git lfs prune every so often, but the reality is that they don't understand the implications of running that (and to be fair, neither do I (I mean, I understand what it does but I don't quite understand the downsides)).
Wouldn't running git lfs prune --verify-remote be 100% safe?
Yes, that would be completely safe. It's also not especially performant because it traverses the entire history. git lfs prune can be slow in some cases, but it's much slower with --verify-remote.
While I'm not opposed to someone adding an option for that as part of the pre-push hook, it's going to be pretty slow on large repositories, so pushing may take a while. That may be an acceptable tradeoff, depending on the situation. I'm going to mark this as an enhancement so we can keep better track of it.
My use-case is for archiving and syncing photos: Hundreds of GB of never-changing raw camera images in Git LFS next to small xmp sidecar files containing editing metadata. While I do want to have one copy of the files locally on hard-drives (manual fetch), I definitely would prefer a slow push over duplication in the cache directory.
I have another use-case: A build server which only runs git pull regularly and never pushes anything. Without a regular git lfs prune the repository-size quickly skyrockets after a few weeks. We have now added git lfs prune to our build pipeline, but I would prefer a git setting, which runs prune regularly in the background (like git gc).
For the build server case, especially lfs.storage is shared between unrelated repositories, it might be useful to delete least recently used files until the total size of remaining files reaches a configurable threshold; or perhaps the threshold should apply to free disk space instead. If the file system does not keep track of file access times, then file creation times might work nearly as well.
By the way, can someone clarify, what is the current behaviour of git lfs prune and/or git lfs prune --verify-remote in cases, when there are multiple LFS repositories with a common lfs.storage directory.
Would running git lfs prune in one of the repositories prune all the LFS files or only the ones that belong to the current repository? If the answer is "it will prune all the files", does the same apply to --verify-remote? Which remote is getting checked in --verify-remote? The remote of the current repository or the remote that the file belongs to?
I am using lfs.storage because I want my LFS files to be stored in a centralized location (for deduplication and storage), but it seems to me that the lfs prune operation isn't currently implemented with this use case in mind.
The documentation for lfs.storage says this:
Note: you should not run
git lfs pruneif you have different repositories sharing the same storage directory.
I expect it will probably cause data loss, although I haven't looked at the code to verify that. I don't think --verify-remote will help here because if repositories A and B share an object O which is on the remote for A and not B, --verify-remote in A could delete the object, and that would cause data loss for B.
I expect it will probably cause data loss, although I haven't looked at the code to verify that. I don't think
--verify-remotewill help here because if repositories A and B share an object O which is on the remote for A and not B,--verify-remotein A could delete the object, and that would cause data loss for B.
I am not really concerned with the case, where an object is shared between different remotes. Such a case should be really rare and even if it did happen, it's not really loss of data, if the data exists at least somewhere.
What I am concerned about, is whether --verify-remote will delete blobs, which don't belong to the current repository at all.
Say, I have 2 repositories local/A and local/B, which share their LFS storage and have files a and b respectively. I would expect/prefer, that running git lfs prune --verify-remote inside repo A would not delete the blob for file b(unless it knows that the correct remote for b isorigin/B can prove that b is present in that remote). Is that the case?
If not, I think that this might be a serious bug with lfs.storage/lfs prune.
You cannot safely run git lfs prune at all when you share storage between repositories. The documentation is very clear about this. Adding additional options doesn't make it safe, and if you do so anyway, you will probably experience data loss. It isn't a bug that this happens because it's clearly documented that it's unsafe to do so.
The --verify-remote option only checks reachable objects, not unreachable objects, so it doesn't prevent pruning in a shared storage situation. You really cannot safely do this.
It isn't a bug that this happens because it's clearly documented that it's unsafe to do so.
I'll have to disagree on that one. Documented problems with the software are still problems with the software.
- If
pruneoperations don't work correctly whenlfs.storageis specified, you should definitely refuse to execute them, so that the user doesn't loose the data accidentally. If it "should never be done", why allow unsuspecting users to run it? You can easily detect, whetherlfs.storageis set. - The documentation of
verify-remotestates that it "ensures that any LFS files to be deleted have copies on the remote before actually deleting them". This statement is factually wrong in this case. A small "Note:" at the beginning of the man page is not sufficient IMO. Also, that "Note:" directs the user to thegit lfs configman page for "more details", butgit lfs confighas basically the same "Note:" and no more info. - Actually, technically the
verify-remotedescription might be wrong even withoutlfs.storage. If my understanding is correct,git lfs prune --verify-remotewill also prune dangling/reflogged blobs that are not present on the remote. Although this behavior is probably correct in this context, it is technically NOT what the documentation says. - I think that we should focus more on what ought to be, rather than how it is currently. What is the intended use case for
lfs.storage? It is a global configuration value, that accepts absolute paths. My impression/expectation was that it is thus intended to be shared across multiple repositories. If that is the case, how are we supposed to clean thelfs.storage, when it grows overtime? It's kind of silly to suggest that the current behavior is not a bug. You can't expect users to allocate infinite disk space for thelfs.storage, so it has to be cleaned somehow, right?
lfs.storage is a way you can place the storage for your LFS data for a single repository on a different disk or location, since the data can be large. It wasn't originally designed as a way to share data across repositories. We also cannot know whether such data is shared without crawling every directory on the system. In that light, it makes perfect sense to prune data that is in such a location and to document that such a behavior is dangerous if using it in a way that wasn't intended.
Since this issue is about pruning data automatically and not about pruning data manually with lfs.storage, I'm going to ask you to open a new issue if you'd like to see different behaviors so that we can track this independently and avoid this issue getting off topic. If you just want to ask a question or discuss the behavior, the Discussions area is ideal for that. We don't want to hide unrelated topics in issues because it makes them hard to find for users and hard to act on for contributors.
lfs.storageis a way you can place the storage for your LFS data for a single repository on a different disk or location, since the data can be large. It wasn't originally designed as a way to share data across repositories.
Yes, however, I don't see a good way to accomplish this using lfs.storage without sharing it between different repositories. Since it's a git config option, the lfs.storage path will be the same for all repositories (unless you configure each one by hand and even then, this wouldn't work when cloning new repositories).
We also cannot know whether such data is shared without crawling every directory on the system. In that light, it makes perfect sense to prune data that is in such a location and to document that such a behavior is dangerous if using it in a way that wasn't intended.
I understand, that this is how it works at the moment. The question is whether it should work that way. I can understand it, if you think that this is outside the scope of git lfs and don't want this feature, but let's not pretend like it's impossible to implement lfs.storage in a repository-aware way. I can easily come up with a few trivial solutions to this problem without crawling every directory on the system. And if you don't want to support multiple repositories sharing a lfs.storage location, then it should be made more clear in the documentation.
Since this issue is about pruning data automatically and not about pruning data manually with
lfs.storage, I'm going to ask you to open a new issue if you'd like to see different behaviors so that we can track this independently and avoid this issue getting off topic. If you just want to ask a question or discuss the behavior, the Discussions area is ideal for that. We don't want to hide unrelated topics in issues because it makes them hard to find for users and hard to act on for contributors.
My bad, I originally wrote in this issue, because I was concerned that automatic pruning would lead to breakage when used together with lfs.storage (which apparently it would) and got a little carried away. I will create a separate issue for this.
Maybe a hook to the gc auto hook that runs git lfs prune if lfs.storage is not setup for multiple repos.
Git occasionally does garbage collection as part of its normal operation, by invoking git gc --auto. The pre-auto-gc hook is invoked just before the garbage collection takes place, and can be used to notify you that this is happening, or to abort the collection if now isn’t a good time.