git-lfs icon indicating copy to clipboard operation
git-lfs copied to clipboard

Pruning automatically

Open Nepoxx opened this issue 6 years ago • 14 comments

While there is existing discussion on trying to prevent LFS from keeping two copies of files (https://github.com/git-lfs/git-lfs/issues/1969 and https://github.com/git-lfs/git-lfs/issues/2147), running git lfs prune automatically could mitigate the issue for most users.

One way to do this right now is to add git lfs prune in the pre-push hook, but that's inconvenient. Adding an option to automatically do this would be more user-friendly. Thoughts?

Nepoxx avatar May 06 '19 17:05 Nepoxx

Hey, thanks for writing in.

I will say I'm a bit nervous about putting in an automatic prune capability, simply because if something goes wrong, it's really easy to lose data. It's already not uncommon for people to make mistakes which cause them to lose their LFS objects, and making data loss easier for people is not something I want to do.

Also, it isn't actually possible to remove all of the objects at the pre-push stage anyway because some of the Git objects haven't been pushed yet, and we don't allow pruning LFS objects that are referenced by unpushed Git objects (because the server might not have them yet). There isn't a good hook point for us to automatically remove such objects, although adding it to the pre-push hook would prune most of the objects. I do, however, think that adding it to the hook is probably the right way forward for people who are very certain that they know what they're doing and want to minimize the number of LFS objects they have on disk.

bk2204 avatar May 06 '19 19:05 bk2204

The problem I'm facing is that we have artists that commit huge files frequently and it fills up their hard drive rapidly. The "solution" is to have them run git lfs prune every so often, but the reality is that they don't understand the implications of running that (and to be fair, neither do I (I mean, I understand what it does but I don't quite understand the downsides)).

Wouldn't running git lfs prune --verify-remote be 100% safe?

Nepoxx avatar May 06 '19 19:05 Nepoxx

Yes, that would be completely safe. It's also not especially performant because it traverses the entire history. git lfs prune can be slow in some cases, but it's much slower with --verify-remote.

While I'm not opposed to someone adding an option for that as part of the pre-push hook, it's going to be pretty slow on large repositories, so pushing may take a while. That may be an acceptable tradeoff, depending on the situation. I'm going to mark this as an enhancement so we can keep better track of it.

bk2204 avatar May 06 '19 20:05 bk2204

My use-case is for archiving and syncing photos: Hundreds of GB of never-changing raw camera images in Git LFS next to small xmp sidecar files containing editing metadata. While I do want to have one copy of the files locally on hard-drives (manual fetch), I definitely would prefer a slow push over duplication in the cache directory.

burnpanck avatar May 29 '19 19:05 burnpanck

I have another use-case: A build server which only runs git pull regularly and never pushes anything. Without a regular git lfs prune the repository-size quickly skyrockets after a few weeks. We have now added git lfs prune to our build pipeline, but I would prefer a git setting, which runs prune regularly in the background (like git gc).

falco467 avatar Dec 22 '20 12:12 falco467

For the build server case, especially lfs.storage is shared between unrelated repositories, it might be useful to delete least recently used files until the total size of remaining files reaches a configurable threshold; or perhaps the threshold should apply to free disk space instead. If the file system does not keep track of file access times, then file creation times might work nearly as well.

KalleOlaviNiemitalo avatar Dec 23 '20 13:12 KalleOlaviNiemitalo

By the way, can someone clarify, what is the current behaviour of git lfs prune and/or git lfs prune --verify-remote in cases, when there are multiple LFS repositories with a common lfs.storage directory.

Would running git lfs prune in one of the repositories prune all the LFS files or only the ones that belong to the current repository? If the answer is "it will prune all the files", does the same apply to --verify-remote? Which remote is getting checked in --verify-remote? The remote of the current repository or the remote that the file belongs to?

I am using lfs.storage because I want my LFS files to be stored in a centralized location (for deduplication and storage), but it seems to me that the lfs prune operation isn't currently implemented with this use case in mind.

ruro avatar Jun 14 '21 22:06 ruro

The documentation for lfs.storage says this:

Note: you should not run git lfs prune if you have different repositories sharing the same storage directory.

I expect it will probably cause data loss, although I haven't looked at the code to verify that. I don't think --verify-remote will help here because if repositories A and B share an object O which is on the remote for A and not B, --verify-remote in A could delete the object, and that would cause data loss for B.

bk2204 avatar Jun 15 '21 17:06 bk2204

I expect it will probably cause data loss, although I haven't looked at the code to verify that. I don't think --verify-remote will help here because if repositories A and B share an object O which is on the remote for A and not B, --verify-remote in A could delete the object, and that would cause data loss for B.

I am not really concerned with the case, where an object is shared between different remotes. Such a case should be really rare and even if it did happen, it's not really loss of data, if the data exists at least somewhere.

What I am concerned about, is whether --verify-remote will delete blobs, which don't belong to the current repository at all.

Say, I have 2 repositories local/A and local/B, which share their LFS storage and have files a and b respectively. I would expect/prefer, that running git lfs prune --verify-remote inside repo A would not delete the blob for file b(unless it knows that the correct remote for b isorigin/B can prove that b is present in that remote). Is that the case?

If not, I think that this might be a serious bug with lfs.storage/lfs prune.

ruro avatar Jun 15 '21 19:06 ruro

You cannot safely run git lfs prune at all when you share storage between repositories. The documentation is very clear about this. Adding additional options doesn't make it safe, and if you do so anyway, you will probably experience data loss. It isn't a bug that this happens because it's clearly documented that it's unsafe to do so.

The --verify-remote option only checks reachable objects, not unreachable objects, so it doesn't prevent pruning in a shared storage situation. You really cannot safely do this.

bk2204 avatar Jun 15 '21 19:06 bk2204

It isn't a bug that this happens because it's clearly documented that it's unsafe to do so.

I'll have to disagree on that one. Documented problems with the software are still problems with the software.

  1. If prune operations don't work correctly when lfs.storage is specified, you should definitely refuse to execute them, so that the user doesn't loose the data accidentally. If it "should never be done", why allow unsuspecting users to run it? You can easily detect, whether lfs.storage is set.
  2. The documentation of verify-remote states that it "ensures that any LFS files to be deleted have copies on the remote before actually deleting them". This statement is factually wrong in this case. A small "Note:" at the beginning of the man page is not sufficient IMO. Also, that "Note:" directs the user to the git lfs config man page for "more details", but git lfs config has basically the same "Note:" and no more info.
  3. Actually, technically the verify-remote description might be wrong even without lfs.storage. If my understanding is correct, git lfs prune --verify-remote will also prune dangling/reflogged blobs that are not present on the remote. Although this behavior is probably correct in this context, it is technically NOT what the documentation says.
  4. I think that we should focus more on what ought to be, rather than how it is currently. What is the intended use case for lfs.storage? It is a global configuration value, that accepts absolute paths. My impression/expectation was that it is thus intended to be shared across multiple repositories. If that is the case, how are we supposed to clean the lfs.storage, when it grows overtime? It's kind of silly to suggest that the current behavior is not a bug. You can't expect users to allocate infinite disk space for the lfs.storage, so it has to be cleaned somehow, right?

ruro avatar Jun 15 '21 21:06 ruro

lfs.storage is a way you can place the storage for your LFS data for a single repository on a different disk or location, since the data can be large. It wasn't originally designed as a way to share data across repositories. We also cannot know whether such data is shared without crawling every directory on the system. In that light, it makes perfect sense to prune data that is in such a location and to document that such a behavior is dangerous if using it in a way that wasn't intended.

Since this issue is about pruning data automatically and not about pruning data manually with lfs.storage, I'm going to ask you to open a new issue if you'd like to see different behaviors so that we can track this independently and avoid this issue getting off topic. If you just want to ask a question or discuss the behavior, the Discussions area is ideal for that. We don't want to hide unrelated topics in issues because it makes them hard to find for users and hard to act on for contributors.

bk2204 avatar Jun 16 '21 13:06 bk2204

lfs.storage is a way you can place the storage for your LFS data for a single repository on a different disk or location, since the data can be large. It wasn't originally designed as a way to share data across repositories.

Yes, however, I don't see a good way to accomplish this using lfs.storage without sharing it between different repositories. Since it's a git config option, the lfs.storage path will be the same for all repositories (unless you configure each one by hand and even then, this wouldn't work when cloning new repositories).

We also cannot know whether such data is shared without crawling every directory on the system. In that light, it makes perfect sense to prune data that is in such a location and to document that such a behavior is dangerous if using it in a way that wasn't intended.

I understand, that this is how it works at the moment. The question is whether it should work that way. I can understand it, if you think that this is outside the scope of git lfs and don't want this feature, but let's not pretend like it's impossible to implement lfs.storage in a repository-aware way. I can easily come up with a few trivial solutions to this problem without crawling every directory on the system. And if you don't want to support multiple repositories sharing a lfs.storage location, then it should be made more clear in the documentation.

Since this issue is about pruning data automatically and not about pruning data manually with lfs.storage, I'm going to ask you to open a new issue if you'd like to see different behaviors so that we can track this independently and avoid this issue getting off topic. If you just want to ask a question or discuss the behavior, the Discussions area is ideal for that. We don't want to hide unrelated topics in issues because it makes them hard to find for users and hard to act on for contributors.

My bad, I originally wrote in this issue, because I was concerned that automatic pruning would lead to breakage when used together with lfs.storage (which apparently it would) and got a little carried away. I will create a separate issue for this.

ruro avatar Jun 16 '21 13:06 ruro

Maybe a hook to the gc auto hook that runs git lfs prune if lfs.storage is not setup for multiple repos.

Git occasionally does garbage collection as part of its normal operation, by invoking git gc --auto. The pre-auto-gc hook is invoked just before the garbage collection takes place, and can be used to notify you that this is happening, or to abort the collection if now isn’t a good time.

vbjay avatar Mar 22 '23 02:03 vbjay