Problems with umlauts on finish?
I'm using Git on Windows
git version 2.13.0.windows.1
With the latest master revision of git-imerge.
$ git config user.name
Andreas Düring
$ git config user.email
Andreas.duering@company
On git imerge finish, the author name contains an umlaut which is encoded wrong. In this case, I attempted a revert. The resulting log looks like this
commit xxxxx (HEAD -> xxxx) # 2nd revert
Author: Andreas Düring <Andreas.duering@company>
Commit: Andreas Düring <Andreas.duering@company>
Revert "xxxxx "
This reverts commit xxxxx .
commit xxxxx #1st revert
Author: Andreas Düring <Andreas.duering@company>
Commit: Andreas Düring <Andreas.duering@company>
Revert "xxxxx "
This reverts commit xxxxx .
commit xxxxx #base commit
Author: Andreas Düring <Andreas.duering@company>
Commit: Andreas Düring <Andreas.duering@company>
xxxxx
This only affects the final commit. The "inbetween" branches/merges are fine.
Forgot the python version:
$ python --version
Python 3.6.0
Maybe this is related to #62 ?
The mojibake that you pasted, Düring, looks like the UTF-8 encoding of ü. Can you check somehow exactly what characters are in the commit? For example, run
git cat-file commit $SHA1
and view the output in a hex editor or something. Try doing the same for one of the other commits, where your name appears correctly.
What is the preferred encoding on your system? On mine it is UTF-8:
$ python -c 'import locale; print locale.getpreferredencoding()'
UTF-8
The output of cat-file of the commit is the weird encoding is
00000000 74 72 65 65 20 62 34 38 65 63 35 62 62 64 61 35 |tree b48ec5bbda5|
00000010 33 33 63 35 63 66 30 32 36 62 33 64 38 36 33 33 |33c5cf026b3d8633|
00000020 31 33 32 62 33 63 64 38 39 35 39 34 37 0a 70 61 |132b3cd895947.pa|
00000030 72 65 6e 74 20 33 32 62 34 32 66 61 32 39 34 39 |rent 32b42fa2949|
00000040 30 30 36 66 66 38 62 30 32 39 61 30 66 64 63 64 |006ff8b029a0fdcd|
00000050 62 34 66 35 63 31 63 32 61 33 30 62 39 0a 61 75 |b4f5c1c2a30b9.au|
00000060 74 68 6f 72 20 41 6e 64 72 65 61 73 20 44 c3 83 |thor Andreas D..|
00000070 c2 bc 72 69 6e 67 20 3c 41 6e 64 72 65 61 73 2e |..ring <Andreas.|
00000080 64 75 65 72 69 6e 67 40 63 6f 6d 70 61 6e 79 3e |duering@company>|
00000090 20 31 34 39 34 38 34 36 38 34 34 20 2b 30 32 30 | 1494846844 +020|
000000a0 30 0a 63 6f 6d 6d 69 74 74 65 72 20 41 6e 64 72 |0.committer Andr|
000000b0 65 61 73 20 44 c3 bc 72 69 6e 67 20 3c 41 6e 64 |eas D..ring <And|
000000c0 72 65 61 73 2e 64 75 65 72 69 6e 67 40 63 6f 6d |reas.duering@com|
000000d0 70 61 6e 79 3e 20 31 34 39 34 38 36 31 39 38 36 |pany> 1494861986|
000000e0 20 2b 30 32 30 30 0a 0a 6d 65 73 73 61 67 65 0a | +0200..message.|
000000f0
So for the author, ü is encoded as c3 83 c2 bc, for the committer as c3 bc (which is the correct UTF-8 encoding of ü).
And the output of the parent commit
00000000 74 72 65 65 20 61 34 33 39 65 37 34 36 31 66 61 |tree a439e7461fa|
00000010 64 62 63 64 65 61 30 35 31 34 36 61 61 65 63 64 |dbcdea05146aaecd|
00000020 35 31 33 63 35 39 34 61 32 36 38 62 30 0a 70 61 |513c594a268b0.pa|
00000030 72 65 6e 74 20 33 37 31 64 62 65 36 33 63 65 63 |rent 371dbe63cec|
00000040 36 36 64 63 61 61 33 30 35 38 63 66 35 38 32 35 |66dcaa3058cf5825|
00000050 32 61 64 35 32 33 64 66 39 36 62 65 64 0a 61 75 |2ad523df96bed.au|
00000060 74 68 6f 72 20 41 6e 64 72 65 61 73 20 44 c3 bc |thor Andreas D..|
00000070 72 69 6e 67 20 3c 41 6e 64 72 65 61 73 2e 64 75 |ring <Andreas.du|
00000080 65 72 69 6e 67 40 63 6f 6d 70 61 6e 79 3e 20 31 |ering@company> 1|
00000090 34 39 34 38 34 31 32 31 30 20 2b 30 32 30 30 0a |494841210 +0200.|
000000a0 63 6f 6d 6d 69 74 74 65 72 20 41 6e 64 72 65 61 |committer Andrea|
000000b0 73 20 44 c3 bc 72 69 6e 67 20 3c 41 6e 64 72 65 |s D..ring <Andre|
000000c0 61 73 2e 64 75 65 72 69 6e 67 40 63 6f 6d 70 61 |as.duering@compa|
000000d0 6e 79 3e 20 31 34 39 34 38 34 31 32 31 30 20 2b |ny> 1494841210 +|
000000e0 30 32 30 30 0a 0a 6d 65 73 73 61 67 65 0a |0200..message.|
000000ee
$ echo $LANG # mingw64 environment
de_DE.UTF-8
However, python outputs
$ python -c 'import locale; print(locale.getpreferredencoding())' #python3 syntax
cp1252
ü in cp1252 is 0xfc.
I set up a small demo repo:
https://github.com/dueringa/imerge-encoding-demo
The problem only arises on a rebase, not on a merge
ü in cp1252 is 0xFC; in Unicode it is the same (U+00FC).
The UTF-8 encoding of U+00FC is 0xC3 0xBC. In cp1252 those bytes represent ü, the characters that you saw.
ü, in turn, are U+00C3 U+00BC in Unicode. Those two characters are encoded in UTF-8 as 0xC3 0x83 0xC2 0xBC, which are the bytes in the faulty commit.
So my guess is that Git is respecting the mingw64 configuration and outputting the commit metadata in UTF-8. But Python is reading the data under the assumption that it is cp1252, converting the ü to two Unicode code points internally, and outputting them as four bytes of UTF-8. When you view the commit, Git assumes that the data are already in UTF-8 so it passes it through unchanged.
I'm not an expert on encoding issues and know even less about encodings on Windows, but it seems to me that you want to tell Python to use UTF-8. One crude way to do this would be to set
PREFERRED_ENCODING = 'utf8'
near the top of the git-imerge script. Does that help?
yes, your solution works fine. The resulting rebase is encoded correctly.
possibly further related SO questions (for future readers):
- http://stackoverflow.com/a/27066059/3872702 - probably linux related
- http://stackoverflow.com/a/11516682/3872702 - won't work in mingw shell, no chcp
- http://stackoverflow.com/q/31469707/3872702