git-imerge icon indicating copy to clipboard operation
git-imerge copied to clipboard

Problems with umlauts on finish?

Open dueringa opened this issue 8 years ago • 7 comments

I'm using Git on Windows

git version 2.13.0.windows.1

With the latest master revision of git-imerge.

$ git config user.name
Andreas Düring

$ git config user.email
Andreas.duering@company

On git imerge finish, the author name contains an umlaut which is encoded wrong. In this case, I attempted a revert. The resulting log looks like this

commit xxxxx (HEAD -> xxxx) # 2nd revert
Author: Andreas Düring <Andreas.duering@company>
Commit: Andreas Düring <Andreas.duering@company>

    Revert "xxxxx "

    This reverts commit xxxxx .

commit xxxxx #1st revert
Author: Andreas Düring <Andreas.duering@company>
Commit: Andreas Düring <Andreas.duering@company>

    Revert "xxxxx "

    This reverts commit xxxxx .

commit xxxxx #base commit
Author: Andreas Düring <Andreas.duering@company>
Commit: Andreas Düring <Andreas.duering@company>

    xxxxx 

This only affects the final commit. The "inbetween" branches/merges are fine.

dueringa avatar May 15 '17 11:05 dueringa

Forgot the python version:

$ python --version
Python 3.6.0

dueringa avatar May 15 '17 11:05 dueringa

Maybe this is related to #62 ?

dueringa avatar May 15 '17 11:05 dueringa

The mojibake that you pasted, Düring, looks like the UTF-8 encoding of ü. Can you check somehow exactly what characters are in the commit? For example, run

git cat-file commit $SHA1 

and view the output in a hex editor or something. Try doing the same for one of the other commits, where your name appears correctly.

What is the preferred encoding on your system? On mine it is UTF-8:

$ python -c 'import locale; print locale.getpreferredencoding()'
UTF-8

mhagger avatar May 16 '17 16:05 mhagger

The output of cat-file of the commit is the weird encoding is

00000000  74 72 65 65 20 62 34 38  65 63 35 62 62 64 61 35  |tree b48ec5bbda5|
00000010  33 33 63 35 63 66 30 32  36 62 33 64 38 36 33 33  |33c5cf026b3d8633|
00000020  31 33 32 62 33 63 64 38  39 35 39 34 37 0a 70 61  |132b3cd895947.pa|
00000030  72 65 6e 74 20 33 32 62  34 32 66 61 32 39 34 39  |rent 32b42fa2949|
00000040  30 30 36 66 66 38 62 30  32 39 61 30 66 64 63 64  |006ff8b029a0fdcd|
00000050  62 34 66 35 63 31 63 32  61 33 30 62 39 0a 61 75  |b4f5c1c2a30b9.au|
00000060  74 68 6f 72 20 41 6e 64  72 65 61 73 20 44 c3 83  |thor Andreas D..|
00000070  c2 bc 72 69 6e 67 20 3c  41 6e 64 72 65 61 73 2e  |..ring <Andreas.|
00000080  64 75 65 72 69 6e 67 40  63 6f 6d 70 61 6e 79 3e  |duering@company>|
00000090  20 31 34 39 34 38 34 36  38 34 34 20 2b 30 32 30  | 1494846844 +020|
000000a0  30 0a 63 6f 6d 6d 69 74  74 65 72 20 41 6e 64 72  |0.committer Andr|
000000b0  65 61 73 20 44 c3 bc 72  69 6e 67 20 3c 41 6e 64  |eas D..ring <And|
000000c0  72 65 61 73 2e 64 75 65  72 69 6e 67 40 63 6f 6d  |reas.duering@com|
000000d0  70 61 6e 79 3e 20 31 34  39 34 38 36 31 39 38 36  |pany> 1494861986|
000000e0  20 2b 30 32 30 30 0a 0a  6d 65 73 73 61 67 65 0a  | +0200..message.|
000000f0

So for the author, ü is encoded as c3 83 c2 bc, for the committer as c3 bc (which is the correct UTF-8 encoding of ü). And the output of the parent commit

00000000  74 72 65 65 20 61 34 33  39 65 37 34 36 31 66 61  |tree a439e7461fa|
00000010  64 62 63 64 65 61 30 35  31 34 36 61 61 65 63 64  |dbcdea05146aaecd|
00000020  35 31 33 63 35 39 34 61  32 36 38 62 30 0a 70 61  |513c594a268b0.pa|
00000030  72 65 6e 74 20 33 37 31  64 62 65 36 33 63 65 63  |rent 371dbe63cec|
00000040  36 36 64 63 61 61 33 30  35 38 63 66 35 38 32 35  |66dcaa3058cf5825|
00000050  32 61 64 35 32 33 64 66  39 36 62 65 64 0a 61 75  |2ad523df96bed.au|
00000060  74 68 6f 72 20 41 6e 64  72 65 61 73 20 44 c3 bc  |thor Andreas D..|
00000070  72 69 6e 67 20 3c 41 6e  64 72 65 61 73 2e 64 75  |ring <Andreas.du|
00000080  65 72 69 6e 67 40 63 6f  6d 70 61 6e 79 3e 20 31  |ering@company> 1|
00000090  34 39 34 38 34 31 32 31  30 20 2b 30 32 30 30 0a  |494841210 +0200.|
000000a0  63 6f 6d 6d 69 74 74 65  72 20 41 6e 64 72 65 61  |committer Andrea|
000000b0  73 20 44 c3 bc 72 69 6e  67 20 3c 41 6e 64 72 65  |s D..ring <Andre|
000000c0  61 73 2e 64 75 65 72 69  6e 67 40 63 6f 6d 70 61  |as.duering@compa|
000000d0  6e 79 3e 20 31 34 39 34  38 34 31 32 31 30 20 2b  |ny> 1494841210 +|
000000e0  30 32 30 30 0a 0a 6d 65  73 73 61 67 65 0a        |0200..message.|
000000ee
$ echo $LANG # mingw64 environment
de_DE.UTF-8

However, python outputs

$ python -c 'import locale; print(locale.getpreferredencoding())' #python3 syntax
cp1252

ü in cp1252 is 0xfc.

dueringa avatar May 17 '17 07:05 dueringa

I set up a small demo repo:

https://github.com/dueringa/imerge-encoding-demo

The problem only arises on a rebase, not on a merge

dueringa avatar May 17 '17 08:05 dueringa

ü in cp1252 is 0xFC; in Unicode it is the same (U+00FC). The UTF-8 encoding of U+00FC is 0xC3 0xBC. In cp1252 those bytes represent ü, the characters that you saw. ü, in turn, are U+00C3 U+00BC in Unicode. Those two characters are encoded in UTF-8 as 0xC3 0x83 0xC2 0xBC, which are the bytes in the faulty commit.

So my guess is that Git is respecting the mingw64 configuration and outputting the commit metadata in UTF-8. But Python is reading the data under the assumption that it is cp1252, converting the ü to two Unicode code points internally, and outputting them as four bytes of UTF-8. When you view the commit, Git assumes that the data are already in UTF-8 so it passes it through unchanged.

I'm not an expert on encoding issues and know even less about encodings on Windows, but it seems to me that you want to tell Python to use UTF-8. One crude way to do this would be to set

PREFERRED_ENCODING = 'utf8'

near the top of the git-imerge script. Does that help?

mhagger avatar May 17 '17 12:05 mhagger

yes, your solution works fine. The resulting rebase is encoded correctly.

possibly further related SO questions (for future readers):

  • http://stackoverflow.com/a/27066059/3872702 - probably linux related
  • http://stackoverflow.com/a/11516682/3872702 - won't work in mingw shell, no chcp
  • http://stackoverflow.com/q/31469707/3872702

dueringa avatar May 17 '17 12:05 dueringa