git-filter-repo icon indicating copy to clipboard operation
git-filter-repo copied to clipboard

Add support for history rewriting of submodule

Open tehKaiN opened this issue 4 years ago • 5 comments

Hi,

I have the following structure of repos, the problem appears while migrating from bitbucket to github (or vice versa, whatever)

main_repo/
├─ submodule A/
   ├─ submodule B/

I've done the following:

  • pushed repo of submodule B from BB to GH
  • rewritten history of submodule A's repo to have GH link of submodule B in .gitmodules using git filter-repo --replace-text
  • pushed repo of submodule A from BB to GH
  • at this point history of submodule A has been rewritten
  • rewritten history of main repo to have GH link of submodule A in .gitmodules using git filter-repo --replace-text
  • unfortunately, the main repo commits still use hashes from old submodule A history.

I've dug a bit and stumbled upon this: https://stackoverflow.com/questions/44430124/ . The relevant quote is below:

Well, the mechanics are essentially another filter-branch: read every commit, check its submodule references, check whether they're mapped, if so apply mapping, and write new commit from updated index. Each of these parts is hard on its own but two are solved by the existing filter-branch code: all you need are the three in the middle ("check submodule references, remap them if mapped" if we boil them down to two steps). Just code those up and the problem is solved.

I see that git filter-repo generates ref/commit maps, so now I'm off to reading how git works and how to do the needed changes. My question is:

  • if I hack in the required functionality to git-filter-repo, would you be interested in pull request?
  • perhaps you can give me some further hints on how to add said functionality?

tehKaiN avatar Jun 15 '21 09:06 tehKaiN

I've tried doing the required stuff by using the callback mechanism - I'm able to find all commits which modify the submodule:

git filter-repo --commit-callback "
for change in commit.file_changes:
  if change.filename == b'repoA':
    print('commit with submodule change:')
    for c in commit.file_changes:
      print(c.filename)
" --dry-run

I can also list all the blobs of those changes:

git filter-repo --commit-callback "
for change in commit.file_changes:
  if change.filename == b'repoA':
    print(change.blob_id)
" --dry-run

but when I try to find blobs with such id to rewrite their contents, there's nothing:

$ git filter-repo --blob-callback "
if blob.original_id == b'6b13e02bd56254afec784dd1fcf5291b632af3de':
  print(blob.data)
" --dry-run

The result is empty. I've found the information that submodules are not stored as blobs, but as commit objects, but I can't really see them in the list generated by first code snippet. Any ideas how to walk through them?

tehKaiN avatar Jun 15 '21 11:06 tehKaiN

After some digging and group effort in my workplace, we've come up with the solution:

  • do the history rewrite on the submodule, keep the ./.git/filter-repo/commit-map as it contains from-to commit hashes generated by last filter-repo operation
  • execute the following with proper path to filter-repo commit-map file:
git filter-repo --commit-callback "
for change in commit.file_changes:
  if change.filename == b'deps/midhal':
    map_file = open('../commit-map-submodule.txt', 'r')
    map_lines = map_file.readlines()

    found = False
    for map_entry in map_lines:
      map_entry = map_entry.rstrip().split(' ')
      map_entry[0] = map_entry[0].encode('ascii')
      map_entry[1] = map_entry[1].encode('ascii')
      if change.blob_id == map_entry[0]:
        change.blob_id = map_entry[1]
        found = True
        print('Replaced hashes: {} -> {}'.format(map_entry[0], map_entry[1]))
        break
    if not found:
      # Someone force-pushed commit after commiting the main repo?
      print('WARN: Can\'t find replacement for hash {} in commit {}'.format(change.blob_id, commit.message))
"

this needs a bit of manual work, and I don't really see how to automate it, so I'll leave it here as is if anyone else is interested in implementing it directly in the tool. Hopefully this script won't break too early for all those poor souls who stumble across this problem in the future. ;)

tehKaiN avatar Jun 16 '21 08:06 tehKaiN

Thank you very much for the solution! :) I've adapted it for our project and it appears to work quite nicely on the local copy of the repo.

However, there is still a thing to take in consideration: it can happen that a submodule commit is completely deleted as a result of filtering. In that case, the commit SHA will be replaced with the null SHA (i.e. 0000...000). That's no problem, until you try to upload the repo to github. In my case, it gives an HTTP error, likely because of the null SHA.

ZEB1CLJ avatar May 18 '24 11:05 ZEB1CLJ