Add support for history rewriting of submodule
Hi,
I have the following structure of repos, the problem appears while migrating from bitbucket to github (or vice versa, whatever)
main_repo/
├─ submodule A/
├─ submodule B/
I've done the following:
- pushed repo of submodule B from BB to GH
- rewritten history of submodule A's repo to have GH link of submodule B in .gitmodules using
git filter-repo --replace-text - pushed repo of submodule A from BB to GH
- at this point history of submodule A has been rewritten
- rewritten history of main repo to have GH link of submodule A in .gitmodules using
git filter-repo --replace-text - unfortunately, the main repo commits still use hashes from old submodule A history.
I've dug a bit and stumbled upon this: https://stackoverflow.com/questions/44430124/ . The relevant quote is below:
Well, the mechanics are essentially another filter-branch: read every commit, check its submodule references, check whether they're mapped, if so apply mapping, and write new commit from updated index. Each of these parts is hard on its own but two are solved by the existing filter-branch code: all you need are the three in the middle ("check submodule references, remap them if mapped" if we boil them down to two steps). Just code those up and the problem is solved.
I see that git filter-repo generates ref/commit maps, so now I'm off to reading how git works and how to do the needed changes. My question is:
- if I hack in the required functionality to git-filter-repo, would you be interested in pull request?
- perhaps you can give me some further hints on how to add said functionality?
I've tried doing the required stuff by using the callback mechanism - I'm able to find all commits which modify the submodule:
git filter-repo --commit-callback "
for change in commit.file_changes:
if change.filename == b'repoA':
print('commit with submodule change:')
for c in commit.file_changes:
print(c.filename)
" --dry-run
I can also list all the blobs of those changes:
git filter-repo --commit-callback "
for change in commit.file_changes:
if change.filename == b'repoA':
print(change.blob_id)
" --dry-run
but when I try to find blobs with such id to rewrite their contents, there's nothing:
$ git filter-repo --blob-callback "
if blob.original_id == b'6b13e02bd56254afec784dd1fcf5291b632af3de':
print(blob.data)
" --dry-run
The result is empty. I've found the information that submodules are not stored as blobs, but as commit objects, but I can't really see them in the list generated by first code snippet. Any ideas how to walk through them?
After some digging and group effort in my workplace, we've come up with the solution:
- do the history rewrite on the submodule, keep the ./.git/filter-repo/commit-map as it contains from-to commit hashes generated by last filter-repo operation
- execute the following with proper path to filter-repo commit-map file:
git filter-repo --commit-callback "
for change in commit.file_changes:
if change.filename == b'deps/midhal':
map_file = open('../commit-map-submodule.txt', 'r')
map_lines = map_file.readlines()
found = False
for map_entry in map_lines:
map_entry = map_entry.rstrip().split(' ')
map_entry[0] = map_entry[0].encode('ascii')
map_entry[1] = map_entry[1].encode('ascii')
if change.blob_id == map_entry[0]:
change.blob_id = map_entry[1]
found = True
print('Replaced hashes: {} -> {}'.format(map_entry[0], map_entry[1]))
break
if not found:
# Someone force-pushed commit after commiting the main repo?
print('WARN: Can\'t find replacement for hash {} in commit {}'.format(change.blob_id, commit.message))
"
this needs a bit of manual work, and I don't really see how to automate it, so I'll leave it here as is if anyone else is interested in implementing it directly in the tool. Hopefully this script won't break too early for all those poor souls who stumble across this problem in the future. ;)
Thank you very much for the solution! :) I've adapted it for our project and it appears to work quite nicely on the local copy of the repo.
However, there is still a thing to take in consideration: it can happen that a submodule commit is completely deleted as a result of filtering. In that case, the commit SHA will be replaced with the null SHA (i.e. 0000...000). That's no problem, until you try to upload the repo to github. In my case, it gives an HTTP error, likely because of the null SHA.