Can't rewrite commits with a malformed author line
TL;DR git-filter-repo does not recover if the Author line is malformed (here, ␣<Thomas>, note open box up front is really a space, but added for clarity). This is because git-filter-repo tries to unconditionally regex-parse the author line. Full details including a reproducer below.
https://github.com/dkogan/emacs-snapshot is not fsck-safe and I am examining how to fix it. I discovered this because my gitconfig fscks by default:
[fetch]
fsckObjects = true
[transfer]
fsckObjects = true
[receive]
fsckObjects = true
resulting in:
$ git clone https://github.com/dkogan/emacs-snapshot
Cloning into 'emacs-snapshot'...
remote: Enumerating objects: 1421351, done.
remote: Counting objects: 100% (223/223), done.
remote: Compressing objects: 100% (109/109), done.
error: object 676f341671bedf6007a029bb2f3c472ebe308603: missingNameBeforeEmail: invalid author/committer line - missing space before email
fatal: fsck error in packed object
fatal: index-pack failed
I fsck'd the entire repo after downloading it with auto-fsck disabled in a sandbox and got the following result:
$ git fsck
Checking object directories: 100% (256/256), done.
error in commit 676f341671bedf6007a029bb2f3c472ebe308603: missingNameBeforeEmail: invalid author/committer line - missing space before email
error in commit 7fb42a25e934250edf3ed3c0ef16be2a16c3e3a3: missingNameBeforeEmail: invalid author/committer line - missing space before email
error in commit b563928ca802b46a1facff442d91c6e922ee8e1d: missingNameBeforeEmail: invalid author/committer line - missing space before email
error in commit dd17bafcae829d6420d099e169bbbcec2cdd6cf6: missingNameBeforeEmail: invalid author/committer line - missing space before email
Checking objects: 100% (1421351/1421351), done.
Checking connectivity: 1421351, done.
These are all commits by Thomas Bushnell in 1996/1997! The problem got fixed in the upstream repo, the same commits exist but with a corrected author line:
$ git log --since "Nov 21 1996" --until "Nov 22 1996"
commit aab72f0761de41b0c851c2fae08a10bcc4dc19e2
Author: Thomas Bushnell, BSG <[email protected]>
Date: Thu Nov 21 22:20:09 1996 +0000
Revert last change.
commit 416f76359368444a1169d9bc66d2c65cbf9034b4
Author: Thomas Bushnell, BSG <[email protected]>
Date: Thu Nov 21 21:51:00 1996 +0000
* config.sub: Recognize gnu-gnu* along with linux-gnu* as a valid
kernel-os combination. Remove `-gnu*' from the portable systems
list. Add `-gnu-gnu*'. Add new rule for `-gnu*' to turn it into
two part name.
commit 31f9169164c9a336520e9b4669cbd6ee17cfdb64
Author: Thomas Bushnell, BSG <[email protected]>
Date: Thu Nov 21 21:43:48 1996 +0000
Thu Nov 21 16:42:41 1996 Thomas Bushnell, n/BSG <[email protected]>
* config.guess [UNAME_SYSTEM == GNU]: Use a four-part
configuration name for gnu so it can be distinguished from
foo-foo-linux-gnu with simple globbing patterns.
However, the bad commit did not occur in the ancestors for HEAD in emacs-snapshot:
emacs-snapshot $ git rev-list HEAD | grep 7fb42a25e934250edf3ed3c0ef16be2a16c3e3a3
... but I could find it thusly:
git log --all 7fb42a25e934250edf3ed3c0ef16be2a16c3e3a3..
... which told me that they were independent trees, so I went to find the reason it's being held on to:
git log -1 --oneline --all --ancestry-path 7fb42a25e934250edf3ed3c0ef16be2a16c3e3a3..
and found:
b5e2b5a4d51 (origin/master_before_upstream_git_transition) new snapshot
I tried to use git filter-repo to fix this:
git filter-repo --use-mailmap --force --refs origin/master_before_upstream_git_transition
Parsed 15573 commitsTraceback (most recent call last):
File "/home/lvh/.pyenv/versions/3.8.0/bin/git-filter-repo", line 3956, in <module>
filter.run()
File "/home/lvh/.pyenv/versions/3.8.0/bin/git-filter-repo", line 3892, in run
self._parser.run(self._input, self._output)
File "/home/lvh/.pyenv/versions/3.8.0/bin/git-filter-repo", line 1409, in run
self._parse_commit()
File "/home/lvh/.pyenv/versions/3.8.0/bin/git-filter-repo", line 1194, in _parse_commit
(author_name, author_email, author_date) = self._parse_user(b'author')
File "/home/lvh/.pyenv/versions/3.8.0/bin/git-filter-repo", line 1077, in _parse_user
(name, email, when) = user_regex.match(self._currentline).groups()
AttributeError: 'NoneType' object has no attribute 'groups'
fatal: stream ends early
fast-import: dumping crash report to .git/fast_import_crash_1665513
I tried with explicit Python code but got the same problem:
git filter-repo --commit-callback 'if commit.author_email == "Thomas": print(commit)' --force --refs origin/master_before_upstream_git_transition
I apologize for the long delay. I suspect you've long since found a workaround, but just in case it helps...
You could create replace objects for the relevant commits using git replace --edit $HASH for each commit in question, and then use git filter-repo --force to make the replacements permanent.
My only objection is that you felt the need to apologize :sweat_smile: I figured a real world use case for why you might want to avoid unconditionally parsing that entry might be valuable. My actual problem is already resolved. Thank you though!