git-lfs icon indicating copy to clipboard operation
git-lfs copied to clipboard

Bogus "Files... that should have been pointers" after migration

Open Code7R opened this issue 3 years ago • 5 comments

Repro:

use the latest git-bash (2.36.1.windows.1) which includes git-lfs/3.1.4.

Migrate an existing huge repository (hundreds of branches) following the guide from https://docs.gitlab.com/ee/topics/git/lfs/migrate_to_git_lfs.html with the only difference of not using BFG but "git lfs migrate import --everything --above=1Mb".

Result: it seems to be okay on the migrating machine, and when pushed, the LFS object took long time, as expected.

But: tried to clone from another machine seems to be okay, and LFS is populating the .git/lfs contents with files needed for master branch... apparently. But then it suddenly reports "Encountered ... files that should have been pointers". But why? I am not aware of any wrong doing and there were no warnings in the import step.

I switched between different branches and almost always got this. Interestingly, when inspecting that files with gitk, sometimes I see real contents and sometimes there are only pointer contents.

I am pretty sure the same operation worked just fine a few weeks ago with some previous version of git-bash. However, I had to upgrade because it has shown a really dreadful performance when importing a large repo (it was the version git-lvs v3.0.2 inside, the performance bug I have seen in some other ticket here). I cannot see a clear pattern.

I have a vague impression that this comes when switching from different branches with different versions of the .gitattributes files (since some of the size changes triggered the conversion and some did not). However, the basic expectation is that it's some kind of git integration bug so that it doesn't cleanup the file state when switching between branches. And it also doesn't explain why even the very first checkout while cloning has shown this symptoms.

Code7R avatar May 23 '22 10:05 Code7R

Hey,

My guess as to what's going on here is that on some branches, you have some files that are in LFS, and on other branches, they're not. That's because on one side, they were larger than 1 MB, and on another, they weren't. Then, when something got merged, some files that were LFS files on one side didn't get run through the filter as part of the merge, and they remained as non-LFS files.

We'd need to see the .gitattributes and whether each of those files is an LFS file (e.g., the output of git lfs ls-files) in order to know for certain. The easiest way to fix this is to do git add --renormalize . and then commit in your main branch on an otherwise clean working tree, and then proceed to merge the main branch into each other branch (or rebase the others on top of it).

That's one of the reasons why --above=1Mb isn't a great idea: it's easy for files to be just above or below that. Ideally, you pick a threshold where files don't cross that line at all in your repository, avoiding this problem altogether.

bk2204 avatar May 24 '22 17:05 bk2204

Hi,

well, from naive user's perspective, nothing is simple here. Imagine, "git reset --hard" is not working. "git checkout --force ..." is not working. "git stash -u" is not working. None of the usual tools to recover can help you here, all you get is this "Encountered... should have been pointers" message and nothing giving any idea WHY. And doing "git add --renormalize . && git commit" is nothing you would normally consider because $USER knows that it has all been commited before.

Until I realized what is going on. Basically, what git-lfs-migrate is doing seems to be wrong. For files which sometimes get below and sometimes above the threshold, the file gets tracked and untracked but that change is apparently NOT reflected in .gitattributes. So, if I understand the git-lfs mechanics correctly, then git is rightfully getting confused and the migration has done wrong things. At least this is how it looked to us!

So, after discussing with developers, the obvious solution would be creating a .gitattributes file with wildcards covering all our BLOB locations (so this back-and-forth tracking/untracking does not happen). And sinking this version into the begin of rewritten history. It also seemed like something similar could be done with the --fixup option of git-lfs-migrate. The way it's documentation reads, git-lfs-migrate would consider "any existing .gitattributes file(s)" which probably means creating a superset of all versions found in all branches and use this "maximum version" from the start. This way or another, both variants (manual wildcards or superset) should do the job.

But not so easy. I tried another migration run and the end result is: NO. It does NOT take contents from all versions. At the first-glance, the only improvement difference is that entries are never removed and the .gitattributes file keeps getting extended, okay. But if you still have concurrent changes in different branches, they are still not synchronized.

Now I am looking how we could incorporate this .gitattributes file using git-filter-repo, it seems to have something similar in the feature set. However it's not clear whether it runs git-lfs hooks during processing, yet.

Code7R avatar May 30 '22 15:05 Code7R

Yes, it's known that the normal Git commands can't recover from that; that's a limitation in Git that we can't avoid. It is generally better to match general patterns whenever possible because that leads to more deterministic behaviour, as you've noticed.

If you can provide a script that reproduces this problem, we're happy to take a look at what we can do to fix it or improve the situation.

bk2204 avatar May 30 '22 15:05 bk2204

I cannot provide much more because of confidentiality reasons. I can only comment on what was found as feasible workaround: a) using git-filter repo to wipe .gitattributes from the history, b) then replace it at the root commit (there is a script called insert-beginning.py comit with git-filter-repo), c) during that last git-filter-repo run there are some git-lfs warnings shown, similar to the one above, but those seem to be harmless, d) run "git-lfs-migrate import --fixup --everything" on the repo.

On top, extra scriptology is needed to handle replace refs, so that the mappings from each git-filter-repo and git-lfs-migrate runs are collected and then a set of replace-refs is calculated which points from original commits to the final ones. This worked quite good in the end, just PITA to implement.

So it eventually works, just the user experience is not that great.

Code7R avatar Jul 12 '22 10:07 Code7R

Basically, what git-lfs-migrate is doing seems to be wrong. For files which sometimes get below and sometimes above the threshold, the file gets tracked and untracked but that change is apparently NOT reflected in .gitattributes.

The question of files "bouncing" above and below the size limit given with the --above option but not being untracked in .gitattributes when they drop below the limit came up in #5134 and so I thought I would copy the reproduction case I created in answer to that question here as well, for reference:

$ git init above
$ cd above
$ echo 1234567890 > foo.bin
$ git add foo.bin
$ git commit -m ten
$ echo 12345678901234567890 > foo.bin 
$ git add foo.bin
$ git commit -m twenty
$ git show HEAD^:foo.bin > foo.bin
$ git add foo.bin
$ git commit -m 'ten again'

$ git lfs migrate import --everything --above=15b
$ cat .gitattributes
/foo.bin filter=lfs diff=lfs merge=lfs -text
$ git lfs fsck
pointer: unexpectedGitObject: "foo.bin" (treeish 1f01798220ed10885e5b6622f2f6384607ea79bd) should have been a pointer but was not

$ git blame .gitattributes
ce1543be (Chris Darroch 2022-10-04 20:23:37 -0700 1) /foo.bin filter=lfs diff=lfs merge=lfs -text

$ git show --oneline ce1543be 
ce1543b twenty
diff --git a/.gitattributes b/.gitattributes
new file mode 100644
index 0000000..daca9d7
--- /dev/null
+++ b/.gitattributes
@@ -0,0 +1 @@
+/foo.bin filter=lfs diff=lfs merge=lfs -text
diff --git a/foo.bin b/foo.bin
index a32a434..677de18 100644
--- a/foo.bin
+++ b/foo.bin
@@ -1 +1,3 @@
-1234567890
+version https://git-lfs.github.com/spec/v1
+oid sha256:8a6a84214f1518875819761ffb53d33ec19e66a089213b906720f6825715e8b3
+size 21

$ git show --oneline HEAD
1f01798 (HEAD -> main) ten again
diff --git a/foo.bin b/foo.bin
index 677de18..a32a434 100644
--- a/foo.bin
+++ b/foo.bin
@@ -1,3 +1 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:8a6a84214f1518875819761ffb53d33ec19e66a089213b906720f6825715e8b3
-size 21
+1234567890

chrisd8088 avatar Oct 05 '22 03:10 chrisd8088