bfg-repo-cleaner icon indicating copy to clipboard operation
bfg-repo-cleaner copied to clipboard

Files from protected commits loose their history, show up as if in last commit only

Open vorburger opened this issue 9 years ago • 5 comments

Hello @rtyley , first of all, once again thanks for this amazing tool. Here's feedback of something I'm struggling with - unless I misunderstand, files from protected commits loose their history, show up as if in last commit only? Apologies if this terminology isn't 100% accurate, here's what I mean:

The use case is purging old un-used "big" (mostly binary) files from an originally big (4 GB-ish) repo resulting from a git svn clone import from Subversion. So I so something like: java -jar ../bin/bfg*.jar --private -b 512K . - works great, super fast.

As there are some files >512k on HEAD, and because "BFG assumes that your latest commit is a good one, with none of the dirty files you want removing from your history still in it." (great, tx), I obviously get some:

Scanning packfile for large blobs: 387045 Scanning packfile for large blobs completed in 2,230 ms. Found 1089 blob ids for large blobs - biggest=653983912 smallest=262726 Total size (unpacked)=5219004150 Found 24785 objects to protect Found 3 commit-pointing refs : HEAD, refs/heads/master, refs/remotes/git-svn

Protected commits

These are your protected commits, and so their contents will NOT be altered:

  • commit 41c3b0f9 (protected by 'HEAD') - contains 116 dirty files :
    • badaboum (479.7 KB)```

What's... "sub-optimal" is that e.g. the badaboum file in the repo now appears to have (deleted first and then) created in the last commit - it's history appears to have been lost! :( I'm sure this is for a good technical reason of the current implementation - but is there any way to fix / improve this, or any advise/trick/work around you may have? To illustrate:

git show | grep folder/badaboum diff --git a/folder/badaboum b/folder/badaboum +++ b/folder/badaboum diff --git a/folder/badaboum.REMOVED.git-id b/folder/badaboum.REMOVED.git-id --- a/folder/badaboum.REMOVED.git-id diff --git a/folder/badaboum_template b/folder/badaboum_template +++ b/folder/badaboum_template diff --git a/folder/badaboum_template.REMOVED.git-id b/folder/badaboum_template.REMOVED.git-id --- a/folder/badaboum_template.REMOVED.git-id

Ideally, I would have hope that files like badaboum just... stay wherever they are in the history. Possible?

vorburger avatar Jul 24 '14 16:07 vorburger

So, to summarise your issue:

  • You're running the BFG to remove big files from your repository.
  • There are some big files in your HEAD commit (ie 'protected') and so the BFG is not removing those files from that commit.
  • However, those files are also in some previous commits. The BFG is removing the files from those older commits, and you'd prefer for it to not do that - you'd like the history of those files to remain intact.

This kind of question has come up before - eg in https://github.com/rtyley/bfg-repo-cleaner/issues/49#issuecomment-47591961 and the answer is a little subtle:

  • Git really doesn't track files, it tracks content. So when the BFG 'protects a file' in your HEAD commit, it's definitely not protecting all versions of that file - to make that happen would be difficult and CPU-intensive, because Git does not model a direct link between the different versions of that file.
  • Beyond that, even if the content of the file never changes, The BFG may remove it from older commits. This is because the BFG actually performs memoization at the level of trees and commits, but not at the level of blobs - and the protection operates at that level too. So you're not protecting files (like y.txt) when you protect a commit - you're protecting folders. If a folder changes in any way (ie a different file changes), that is enough to remove the protection from earlier versions of that folder.

I hope that explanation makes sense. It's slightly more nuanced than I wanted to put onto the main documentation page.

rtyley avatar Jul 24 '14 22:07 rtyley

I had the same problem as @vorburger and the solution I came up with was that I produced a list of blob ids I wanted to remove (about 10,000 of them in the end) and asked BFG to remove said blobs. This approach worked but I would not actually recommend it as it requires a respectable amount of scripting and manual labour. I have discussed this approach previously on #51.

suniala avatar Jul 25 '14 05:07 suniala

@rtyley tx for your answer, I think I (kind of) "get it" now. @suuntala tx for chiming in, very useful & good to know I'm clearly not the only one hitting this Q; we may consider the option of using --strip-blobs-with-ids instead of --strip-blobs-bigger-than (depending on the effort it would be for us to create the "magic shell/git scripts" to produce such a list CORRECTLY.. hm) - or we'll just accept and live with this during our SVN to Git migration.

vorburger avatar Jul 25 '14 11:07 vorburger

I too was misled by the documentation

If something questionable - like a 10MB file, when you're telling The BFG to strip out everying over 5MB - is in a protected commit, it won't be removed, and because it's still there, there's no point deleting it from earlier commits either. If you want the BFG to delete something you need to make sure your current commits are clean.

I misread "there's no point" as "there's no point and so it won't do it".

I understand the implementation details may preclude this behavior, but I would have expected that if a file from the protected tree to be kept in earlier commits.

I understand that can be a little fuzzy. In other words, git log --follow my-file would have the same history after running BFG (except for changed SHA-1s).

pauldraper avatar Jan 25 '15 05:01 pauldraper

@rtyley, this doesn't exactly match my earlier suggestion, but this is close.

This determine the ids of large blobs except for blobs present in HEAD:

(This uses bash and unix utilities. The max size is specified by 1024 * 1024.)

comm -23 \
    <(git rev-list --objects --all | git cat-file  --batch-check="%(objecttype) %(objectname) %(objectsize) %(rest)" | grep ^blob | awk '$3 > 1024 * 1024 { print $2 }' | sort) \
    <(git ls-tree -r HEAD | cut -f 1 | cut -d ' ' -f 3 | sort) \
    > /tmp/large-blobs.list
java -jar bfg-1.12.0.jar -bi /tmp/large-blobs.list

I list all blobs, filter to those more than 1MB, subtract the blobs on HEAD, and output the ids to large-blobs.list. Then I use BFG to remove those blobs.

pauldraper avatar Feb 19 '15 06:02 pauldraper