git-filter-repo icon indicating copy to clipboard operation
git-filter-repo copied to clipboard

Can't rewrite commits with a malformed author line

Open lvh opened this issue 4 years ago • 2 comments

TL;DR git-filter-repo does not recover if the Author line is malformed (here, ␣<Thomas>, note open box up front is really a space, but added for clarity). This is because git-filter-repo tries to unconditionally regex-parse the author line. Full details including a reproducer below.


https://github.com/dkogan/emacs-snapshot is not fsck-safe and I am examining how to fix it. I discovered this because my gitconfig fscks by default:

[fetch]
  fsckObjects = true
[transfer]
  fsckObjects = true
[receive]
  fsckObjects = true

resulting in:

$ git clone https://github.com/dkogan/emacs-snapshot
Cloning into 'emacs-snapshot'...
remote: Enumerating objects: 1421351, done.
remote: Counting objects: 100% (223/223), done.
remote: Compressing objects: 100% (109/109), done.
error: object 676f341671bedf6007a029bb2f3c472ebe308603: missingNameBeforeEmail: invalid author/committer line - missing space before email
fatal: fsck error in packed object
fatal: index-pack failed

I fsck'd the entire repo after downloading it with auto-fsck disabled in a sandbox and got the following result:

$ git fsck          
Checking object directories: 100% (256/256), done.
error in commit 676f341671bedf6007a029bb2f3c472ebe308603: missingNameBeforeEmail: invalid author/committer line - missing space before email
error in commit 7fb42a25e934250edf3ed3c0ef16be2a16c3e3a3: missingNameBeforeEmail: invalid author/committer line - missing space before email
error in commit b563928ca802b46a1facff442d91c6e922ee8e1d: missingNameBeforeEmail: invalid author/committer line - missing space before email
error in commit dd17bafcae829d6420d099e169bbbcec2cdd6cf6: missingNameBeforeEmail: invalid author/committer line - missing space before email
Checking objects: 100% (1421351/1421351), done.
Checking connectivity: 1421351, done.

These are all commits by Thomas Bushnell in 1996/1997! The problem got fixed in the upstream repo, the same commits exist but with a corrected author line:

$ git log --since "Nov 21 1996" --until "Nov 22 1996"

commit aab72f0761de41b0c851c2fae08a10bcc4dc19e2
Author: Thomas Bushnell, BSG <[email protected]>
Date:   Thu Nov 21 22:20:09 1996 +0000

   Revert last change.

commit 416f76359368444a1169d9bc66d2c65cbf9034b4
Author: Thomas Bushnell, BSG <[email protected]>
Date:   Thu Nov 21 21:51:00 1996 +0000

   * config.sub: Recognize gnu-gnu* along with linux-gnu* as a valid
           kernel-os combination.  Remove `-gnu*' from the portable systems
           list.  Add `-gnu-gnu*'.  Add new rule for `-gnu*' to turn it into
           two part name.

commit 31f9169164c9a336520e9b4669cbd6ee17cfdb64
Author: Thomas Bushnell, BSG <[email protected]>
Date:   Thu Nov 21 21:43:48 1996 +0000

   Thu Nov 21 16:42:41 1996  Thomas Bushnell, n/BSG  <[email protected]>
    
           * config.guess [UNAME_SYSTEM == GNU]: Use a four-part
           configuration name for gnu so it can be distinguished from
           foo-foo-linux-gnu with simple globbing patterns.

However, the bad commit did not occur in the ancestors for HEAD in emacs-snapshot:

emacs-snapshot $ git rev-list HEAD | grep 7fb42a25e934250edf3ed3c0ef16be2a16c3e3a3

... but I could find it thusly:

git log --all 7fb42a25e934250edf3ed3c0ef16be2a16c3e3a3..

... which told me that they were independent trees, so I went to find the reason it's being held on to:

git log -1 --oneline --all --ancestry-path 7fb42a25e934250edf3ed3c0ef16be2a16c3e3a3..

and found:

b5e2b5a4d51 (origin/master_before_upstream_git_transition) new snapshot

I tried to use git filter-repo to fix this:

git filter-repo --use-mailmap --force --refs origin/master_before_upstream_git_transition
Parsed 15573 commitsTraceback (most recent call last):
  File "/home/lvh/.pyenv/versions/3.8.0/bin/git-filter-repo", line 3956, in <module>
    filter.run()
  File "/home/lvh/.pyenv/versions/3.8.0/bin/git-filter-repo", line 3892, in run
    self._parser.run(self._input, self._output)
  File "/home/lvh/.pyenv/versions/3.8.0/bin/git-filter-repo", line 1409, in run
    self._parse_commit()
  File "/home/lvh/.pyenv/versions/3.8.0/bin/git-filter-repo", line 1194, in _parse_commit
    (author_name, author_email, author_date) = self._parse_user(b'author')
  File "/home/lvh/.pyenv/versions/3.8.0/bin/git-filter-repo", line 1077, in _parse_user
    (name, email, when) = user_regex.match(self._currentline).groups()
AttributeError: 'NoneType' object has no attribute 'groups'
fatal: stream ends early
fast-import: dumping crash report to .git/fast_import_crash_1665513

I tried with explicit Python code but got the same problem:

git filter-repo --commit-callback 'if commit.author_email == "Thomas": print(commit)' --force --refs origin/master_before_upstream_git_transition

lvh avatar Jun 26 '21 22:06 lvh

I apologize for the long delay. I suspect you've long since found a workaround, but just in case it helps...

You could create replace objects for the relevant commits using git replace --edit $HASH for each commit in question, and then use git filter-repo --force to make the replacements permanent.

newren avatar Nov 09 '21 23:11 newren

My only objection is that you felt the need to apologize :sweat_smile: I figured a real world use case for why you might want to avoid unconditionally parsing that entry might be valuable. My actual problem is already resolved. Thank you though!

lvh avatar Nov 10 '21 00:11 lvh