git-filter-repo icon indicating copy to clipboard operation
git-filter-repo copied to clipboard

(Force-)pushing rewritten commits to GitHub has no effect

Open kriegaex opened this issue 4 years ago • 11 comments

Hi!

This is maybe more a GitHub than a git-filter-repo question, but maybe you have experience with it. I rewrote the history of some of my already pushed commits in a GitHub fork of an another upstream project. Whatever I try, I cannot get the commits - I added a forgotten "Signed-off-by: ..." to each of my own commits, because the upstream project does not accept PRs without that - pushed to my GitHub fork.

I tried to delete the remote branch containing my commits on GitHub. Then I can push the branch again with the very same commits I have modified - I even see the new commit messages when reviewing the commits in my IDE before pushing. But then after pushing, the old commits are still there, probably because GitHub has not pruned the repository and remembers them.

Next, I deleted my fork on GitHub and forked the upstream repository again. But to my surprise, again after I push, there are the commits with the old commit messages. This feels as if GitHub just keeps one big repository per project and adds forks as branches only. If a fresh clone of the upstream project had been created, my pushed commits would have been seen for the first time, but obviously that is not the case.

Actually, other users should have had the same issue before, so I would like to know if you have any valuable insights here. How do I get my commits updated on GitHub in a real fork (not in a copy of the original repository, because I do want to have that convenient link to the original repo, so it is easy to create PRs).

kriegaex avatar Mar 22 '21 06:03 kriegaex

Thanks for no feedback. 😕

After weeks of discussion with GitHub support, trying all sorts of things, I got this reply, finally explaining the problem:

I have partnered with the engineers and gotten some more information. It looks like the functionality being used here is replace refs; that is, the functionality of git replace. At this time, GitHub doesn't support that. Even if Git does support them, we use libgit2 in some places, and it does not.

The revision that is seen on the  aj_19-java16 branch is 3d5f1fac, and we're seeing 51e3b353. It's likely that that is the underlying commit that is being replaced, and when they're pushing, they're pushing that commit. In order to have replace refs work at all, you also have to push the replace refs under refs/replace, but these haven't been pushed, and even if they had, that still wouldn't result in the desired behavior since we don't support them.

Moreover, since those aren't typically cloned by default, even if we did support them, most users wouldn't have them downloaded and they therefore wouldn't be used. This is one reason why replace refs are infrequently used in general.

Would you please document this as a caveat? Your own project is hosted here on GitHub, so I am quite surprised that you never tried (and failed) by yourself, investigated the problem and documented it.

kriegaex avatar Apr 06 '21 05:04 kriegaex

I think you're looking for --replace-refs delete-no-add

micimize avatar Apr 06 '21 17:04 micimize

Hi,

Sorry for not getting back to you earlier.

You may want to read https://htmlpreview.github.io/?https://github.com/newren/git-filter-repo/blob/docs/html/git-filter-repo.html#DISCUSSION, among the things I cover there:

I tried to point out there a few ways that the fact that you have rewritten your copy doesn't mean you've rewritten everyone else's. If someone else has a clone, or a fork, or a clone of a fork, or... then they've still got a copy of the old history. You have to rewrite all the copies identically, or push the rewritten history somewhere and force everyone to reclone from scratch. I highly recommended the latter, to avoid problems with people combining both old and new history making it look like you had two copies of every commit building on a rewritten commit (and making future cleanups even harder).

I also noted there that many git servers would not allow you to overwrite some of the refs (e.g. refs/changes/, refs/pull/, refs/merge-requests/) which would likely leave you with the old commits still lying around and accessible. It is possible for these hosting sites to provide alternative mechanisms, outside of pushing, to rewrite history. See for example GitLab's documentation on this over here: https://docs.gitlab.com/ee/user/project/repository/reducing_the_repo_size_using_git.html

The combination of the above, plus GitHub's usage of alternates for forks makes it likely that even if you didn't have any existing PRs or other special refs that GitHub doesn't allow you to overwrite, there's still a chance that the linkage between that repo and all the forks via the alternates means you still might be able to see old commits. Which again highlights the fact that rewriting history is easy (or is nowadays), but synchronizing with everyone else is hard. I can't speak much more to the usage of alternates, though, since it's an implementation detail on GitHub that I don't think is thoroughly documented and is thus more at their control.

Your response from GitHub suggests that you asked them a different question than you asked here, or that they interpreted it much differently somehow. Anyway, yes, it's well known that GitHub and Gerrit and likely other repository managers don't support replace refs, as I mentioned in the docs: "Sadly, some existing git servers (e.g. Gerrit, GitHub) do not yet understand replace refs, and thus one can’t use old commit hashes within their UI; this may change in the future. But replace refs at least help users locally within the git CLI." The other things they mentioned, that replace refs are not pushed and pulled by default were also documented. ("If you want to use these replace refs, push them to the relevant clone URL and tell users to adjust their fetch refspec (e.g. git config --add remote.origin.fetch +refs/replace/:refs/replace/)")

I'm sorry that synchronizing is such a pain point, and that there are so many facets to it, but that's just one of the factors that makes the manual so big.

newren avatar Apr 06 '21 19:04 newren

@newren I think their response was in answer to "why are my old commits still around even though I overwrote history." As far as I can tell, to completely rewrite history on github:

  1. run with --replace-refs delete-no-add to remove refs instead of turn them into replace refs.
  2. Git's cache will still surface commits. You need to file a ticket with them to clear their cache and backups, etc. if it is sensitive info.
  3. PR refs live in a protected namespace, so GitHub needs to delete them on their side.
  4. finally, consider enforcing the commit sha exclusion.

In situations where a fresh repo is not desirable for whatever reason, the commit exclusion is a good way of forcing people to re-clone or equivalent.

Does that sound like a fairly complete rundown? I did feel a bit uncertain about the delete-no-add behavior resulting in a clean history.

In @kriegaex's case it doesn't seem like 2 and 3 are particularly important.

Related: #235

micimize avatar Apr 07 '21 15:04 micimize

As far as I can tell, to completely rewrite history on github:

  1. run with --replace-refs delete-no-add to remove refs instead of turn them into replace refs.

I'm not sure why you're suggesting this, is it just to make it easier to check locally whether you still have references to old commits? Otherwise, I don't see how it helps or hurts in rewriting history.

  1. Git's cache will still surface commits. You need to file a ticket with them to clear their cache and backups, etc. if it is sensitive info.

Do you mean GitHub's cache here?

  1. PR refs live in a protected namespace, so GitHub needs to delete them on their side.

I decided to link to the GitHub docs for 2 & 3, despite them being somewhat dangerously out-of-date (they recommend using git-filter-branch against filter-branch's own recommendation!). So, I changed this:

"Finally, you’ll need to consult any documentation from your hosting provider about how to remove any server-side references to the old commits (example: GitLab’s docs on reducing repository size)." (which links to https://docs.gitlab.com/ee/user/project/repository/reducing_the_repo_size_using_git.html)

to this:

"Finally, you'll need to consult any documentation from your hosting provider about how to remove any server-side references to the old commits (example: https://docs.gitlab.com/ee/user/project/repository/reducing_the_repo_size_using_git.html[GitLab's excellent docs on reducing repository size], or just the warning box that references "GitHub support" from https://docs.github.com/en/github/authenticating-to-github/removing-sensitive-data-from-a-repository[GitHub's otherwise dangerously out-of-date docs on removing sensitive data]).

  1. finally, consider enforcing the commit sha exclusion.

I'm very much in favor of using hooks that ban old commits; I mentioned this in the docs. I'd consider adding a link to this particular project in this part of the docs... """ (Optional)...If you have a central repo, you may want to prevent people from pushing old commit IDs, in order to avoid mixing old and new history. Every repository manager does this differently, some provide specialized commands (e.g. https://gerrit-review.googlesource.com/Documentation/cmd-ban-commit.html), others require you to write hooks. """ ...EXCEPT that this particular hook seems bad to me. It abbreviates the blacklist to 7 characters long, then uses log --pretty=format:%h, which auto-abbreviates to an "appropriate" length. Thus, if there are enough hashes in the repository that 8 characters are necessary for the "appropriate" length, then the hook fails to ban commits anymore. Hashes for a tool like this should NOT be abbreviated. Who knows what other bugs are in this hook; if I found that in a few minute overview, I don't trust this one.

Also, while it's been years since I've used pre-receive hooks with GitHub Enterprise and my knowledge is a bit fuzzy, I think this particular hook cannot be used on GitHub. This hook presumes access to the .git/hooks directory on the server, which GitHub does not provide. While GHE (Github Enterprise) does have mechanisms for providing pre-receive hooks, I believe this hook would need to be rewritten to read the banned commits from the global-hooks repository rather than from a sibling file in the .git/hooks directory.

It'd be nicer if GitHub and GitLab provided a built-in way to ban specific commits, much like Gerrit does.

In situations where a fresh repo is not desirable for whatever reason, the commit exclusion is a good way of forcing people to re-clone or equivalent.

Does that sound like a fairly complete rundown?

It's a good rundown, but no, it's not complete. You also need to have people who have cloned the repository rewrite their copies identically (a very risky proposition), or (preferably) toss their existing clones and clone afresh. If you have any forks of the repository and information will flow to or from those forks, you also need to rewrite all those forks and any clones of those forks. This is why I harp on the "RECOVERING FROM UPSTREAM REBASE" docs so much from the git-filter-repo docs.

I'm also worried that steps 2 & 3 might be currently accurate, but subject to change over time as GitHub adds additional features, caches, etc. So, it's nicer to link to their docs on the topic in case those change. It does give me a bit of pause that they haven't bothered to update those docs in a few years, but that's still better than me needing to provide instructions for their site and then attempt to keep them up-to-date.

I did feel a bit uncertain about the delete-no-add behavior resulting in a clean history.

Ah, gotcha. delete-no-add would provide one particular way to verify that old hashes are gone from your local rewrite, though I'd prefer just setting the GIT_NO_REPLACE_OBJECTS environment variable and then run whatever git command you want to verify that old commits IDs result in "fatal: Not a valid object name" messages. If you don't want the mapping of old commits IDs to new ones then the delete-no-add suggestion makes sense, but people can rewrite history and get rid of the old commits without nuking the replace refs.

newren avatar Apr 07 '21 18:04 newren

Sorry for responding so late, I was quite consumed with other topics since I have given up using this tool and reverted to the previous backup of my local repository and force-pushed it back to GitHub, so at least the situation is not worse than before.

Firstly, thanks for updating the documentation. 😊

Secondly, I want to admit that I did not spend much time on understanding all the fine details of your answer. The gist of it for me is that as a GitHub user, I should stay away from git-filter-repo for now.

Actually, if there was a way to automate rebasing all my commits since I started working on this project - maybe a few dozen ones - and force-pushing same when done, that would be okay for me. Nobody else has been working on my PR branch except me. Actually, I do not wish to change the commits with regard to content (object, trees) at all, the only thing I need to change are the commit messages (and, of course, the commit hashes need to get updated, as usual when rebasing). But if I start editing commit messages, I have to re-do merges, resolve the same or very similar conflicts again and again for each rewritten commit and risk altering the results of weeks of work. Is there any on-board means in Git or using your tool, permitting me to do that without git replace, thereby automatically telling Git: "Don't alter anything in those commits other than the commit message."

kriegaex avatar Apr 09 '21 10:04 kriegaex

Having used filter-repo to rewrite github repos, I wonder whether the real root cause observed by @kriegaex is rather that PGP signature are stripped off implying divergence with github repos with implicit PGP signatures from github. See #247 and https://github.com/newren/git-filter-repo/issues/139#issuecomment-803239858

gberche-orange avatar Apr 29 '21 09:04 gberche-orange

@gberche-orange, in my case no signed commits were involved. I have detailed e-mails by GitHub support, explaining that the git replace functionality is simply unsupported in a Git library GH is using. Even if I did push all the refs, it would not work as expected according to their development department.

kriegaex avatar Apr 29 '21 11:04 kriegaex

Thanks your response @kriegaex

Double reading the issue original description reproduced below, I might have misunderstood your problem believing your updated PR with rewritten commit did not include the required "Signed-off-by: ..." header.

Next, I deleted my fork on GitHub and forked the upstream repository again. But to my surprise, again after I push, there are the commits with the old commit messages. This feels as if GitHub just keeps one big repository per project and adds forks as branches only.

I have experienced a similar issue after rewriting my github repo. My problem was linked to Allowing changes to a pull request branch created from a fork feature which fetches the fork commits onto the target repo in the refs/pull/ namespace.

If your PRs happened to have this option, then when you clone the original project and fetch these references, your local clone also contains original github commits (before being rewritten by git filter-repo).

If ever your PRs are still accessible would you mind sharing url to them as allow for such verification ?

gberche-orange avatar Apr 29 '21 11:04 gberche-orange

Firtly, thanks for your interest. You seem to know a lot about this tool. :-) I wish the active support would have come at the time I was feeling blocked both here and by GitHub support, anxiously waiting for my issue to be resolved. This burnt a lot of my time and now I am kinda swamped. I have moved on from that situation, having force-pushed my pre-rewrite commits again and gotten Eclipse to accept them without the stupid, useless "Signed-off by", which is really nothing but an additional line in the commit comment, no digital signature thing, and as such does not prove anything, only repeating the committer information. By the end of last month, Eclipse also finally after many years changed their acceptance policy. Committer info matching the person who signed the ECA is now enough. So my technical problem was solved in an organisational way. Sorry to bore you with off-topic details here, I am just explaining which itch I tried to scratch and how I managed to. Maybe in the future I shall give git-filter-repo another try, if there is a comprehensive guide how to do that with GitHub repos and a clear explanation of preconditions and boundaries.

kriegaex avatar Apr 29 '21 11:04 kriegaex

I'm commenting/watching here in case additional action happens on this issue. I'm not a git or github super-user, and I was investigating using git-filter-repo for removing large binaries from our repo history. Sadly, our private repo is hosted at Github.com and even though I read the docs multiple times, it wasn't clear enough that I can't entirely rewrite the repo history, due to essentially Github.com limitations. (If I'm reading this right.) I'd love it, if there was a way to do this more easily (github builds this tool into a protected/non-publicized location online? special tool to upload a whole .git folder?).

AccuPhoenix01 avatar Jan 23 '23 19:01 AccuPhoenix01