git-filter-repo icon indicating copy to clipboard operation
git-filter-repo copied to clipboard

Feature Idea: Support NFD to NFC conversion in Filenames

Open lambdafu opened this issue 2 years ago • 3 comments

Hi,

first of all: thank you for writing git-filter-repo, it's a huge life-saver. :heart: :hugs:

I am converting several really large SVN repos to git, and use it to reorganize these repositories. While doing so, I noticed that some Mac users committed files in Unicode NFD form (where ü is represented as u+" etc). When checking out these repos on a Mac, git thinks these are modified files (depending on core.precomposeUnicode), which leads to all sorts of issues. The best approach is to rewrite all commits to use NFC consistently, because that is the new defauilt on Mac as well.

I found a script to do the conversion by Michael Maier, https://gitlab.com/-/snippets/1976332, which is using git filter-branch. The script is unfortunately very slow. Obviously it would be a huge win to port this to git filter-repo. However, a catch is that the script only works on Mac, because it is using iconv with the UTF-8-MAC codec, which is a special snowflake and not supported on Linux versions of iconv (I have no idea why, it seems rather silly).

I'd hope that somebody is interested in bringing this script to git filter-repo, hopefully replacing iconv with some Python-based portable solution, so we can all benefit from this.

Anyway, thanks again, and happy hacking!

lambdafu avatar Oct 20 '21 21:10 lambdafu

That script has about 150 lines of boilerplate, but the magic sauce from that snippet appears to be:

function nfd2nfc() {
	echo "$1" | iconv -f utf-8-mac -t utf-8
}

which means you should be able to do this conversion as follows without any feature additions to filter-repo (I haven't tested so I might have a typo or thinko below):

git filter-repo --filename-callback '
    return subprocess.check_output("iconv -f utf-8-mac -t utf-8".split(),
                                   input=filename)
'

which also gets rid of the huge amount of boilerplate.

Do you want to try and confirm whether that works for you?

newren avatar Nov 09 '21 18:11 newren

Awesome! This worked like a charm, and took 20 seconds instead of 20 hours! So this is a speedup by a factor of 3600. I confirmed that this produces identical results (same commit hash at the HEAD of the tree).

lambdafu avatar Nov 10 '21 15:11 lambdafu

Here's to those who are truly lost:

git filter-repo --filename-callback '
    try: 
      return subprocess.check_output("iconv -f utf-8-mac -t utf-8".split(),
                                   input=filename)
    except:
      return filename
'

lambdafu avatar Nov 10 '21 21:11 lambdafu