gitlint icon indicating copy to clipboard operation
gitlint copied to clipboard

Unicode decoding issue on Windows

Open jorisroovers opened this issue 6 years ago • 5 comments

When trying to decode Unicode characters on Windows, gitlint can crash. This can easily be shown by trying to lint the commit 3ee281eec4ef4325ce90d27f6f368a6b95818cfe of the gitlint commit history.

(.venv) C:\Users\Administrator\gitlint>gitlint --commits 3ee281eec4ef4325ce90d27f6f368a6b95818cfe
Traceback (most recent call last):
  File "C:\Users\Administrator\gitlint\gitlint\.venv\Scripts\gitlint-script.py", line 11, in <module>
    load_entry_point('gitlint', 'console_scripts', 'gitlint')()
  File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\core.py", line 717, in main
    rv = self.invoke(ctx)
  File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\core.py", line 1114, in invoke
    return Command.invoke(self, ctx)
  File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "c:\users\administrator\gitlint\gitlint\cli.py", line 180, in cli
    ctx.invoke(lint)
  File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "c:\users\administrator\gitlint\gitlint\cli.py", line 211, in lint
    gitcontext = GitContext.from_local_repository(lint_config.target, ctx.obj[2])
  File "c:\users\administrator\gitlint\gitlint\git.py", line 199, in from_local_repository
    raw_commit = _git("log", sha, "-1", long_format, _cwd=repository_path).split("\n")
  File "c:\users\administrator\gitlint\gitlint\git.py", line 28, in _git
    result = sh.git(*command_parts, **git_kwargs)  # pylint: disable=unexpected-keyword-arg
  File "c:\users\administrator\gitlint\gitlint\shell.py", line 45, in git
    return _exec(*args, **kwargs)
  File "c:\users\administrator\gitlint\gitlint\shell.py", line 65, in _exec
    stdout = ustr(result[0])
  File "c:\users\administrator\gitlint\gitlint\utils.py", line 27, in ustr
    return unicode(obj, DEFAULT_ENCODING)  # pragma: no cover # noqa
  File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\encodings\cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1: character maps to <undefined>

jorisroovers avatar Jul 08 '19 13:07 jorisroovers

I spend some time looking into this. Disclaimer: I'm by no means a unicode expert and what follows is based on limited research. If anyone reading this has more expertise in this area, please chime in!

TLDR: A fix will follow that will takes into account the LC_ALL, LC_CTYPE, LANG env vars on Windows and otherwise fallsback to UTF-8. This matches git's behavior on Windows.

In a nutshell, the problem is here https://github.com/jorisroovers/gitlint/blob/master/gitlint/utils.py#L8

In particular, on my windows 10 VM, Python 2.7, both CMD and Powershell:

$ python -c "import locale; print locale.getpreferredencoding()"
cp1252

This causes problems with gitlint tries to parse unicode characters encoded in UTF-8 (or any encoding other than cp1252).

I first considered to just remove the call to locale.getpreferredencoding() and always use UTF-8. However, while I believe that works for the majority of users, I don't want to downgrade the experience for power-users. For users that want/need an encoding different from UTF-8, this is probably a hard requirement to be able to use gitlint at all.

I then considered to hard-code UTF-8 only on Windows (and keep the current code on other platforms), but after some more research I discovered that git actually adheres to the LC_ALL environment variable on Windows (see footnote 1) - which allows users to define which charset to use. Note that reading the LC_ALL envvar is somewhat atypical for Windows. I personally think this is a rather useful workaround and given gitlint's dependency on git itself, it makes sense to follow the same behavior.

So in a nutshell, what I'm thinking to implement now is:

  1. On non-Windows machines, everything stays as is
  2. On Windows-machines:
    1. Try to determine preferred encoding from environment variables LC_ALL, LC_CTYPE, LANG (in that order)
    2. Otherwise, fallback on UTF-8
    3. Ignore locale.getpreferredencoding() all-together (like git).

Footnotes

Footnote 1: git itself requires LC_ALL on Windows

As a side-note, git itself requires you to explicitly set LC_ALL on Windows to properly display unicode characters, like so:

# Without 'LC_ALL=C.UTF-8'
$ git log -1 --pretty="%an" 3ee281
<C5><81>ukasz Rogalski
# With 'LC_ALL=C.UTF-8'
$ set LC_ALL=C.UTF-8
$ git log -1 --pretty="%an" 3ee281
Łukasz Rogalski

I believe this is the relevant source-code in git itself: https://github.com/git/git/blob/5fa0f5238b0cd46cfe7f6fa76c3f526ea98148d9/gettext.c#L15-L32

Footnote 2:

From what I could gather, there's a lot discussion about python's unicode character decoding on Windows. In particular, from diagonally reading a relevant blog post, I eventually stumbled upon PEP 528 which explains how in Python >= 3.6, Python defaults to UTF-8 for Windows console encoding. More indication that Unicode handling in python is messy, particularly so on Windows.

Footnote 3:

Python allows you to specify the error behavior for unicode errors (Python2, Python3). While it's possible to implement a git-like behavior of printing placeholder chars on unicode decoding errors, I'm probably not going to do this for now. I'd prefer gitlint to hard-crash on decoding issues for now so it's more likely that users report the issues they encounter.

jorisroovers avatar Aug 21 '19 22:08 jorisroovers

The plot thickens!

In my previous comment, I really only considered reads of unicode characters from the git command output, i.e. the cause of the crash in the original description. That issue is mostly solved by c939a0d913d96fbc308399dfaa931740a0db2684 (although I need to amend that commit with a small fix).

However, turns out that writing unicode characters to the Windows console is an entire beast on its own - see issue1602. In a nutshell, properly and consistently printing Unicode characters to the Windows console is very messy in python. From what I can gather, manually working around this is complicated.

The good news is that the Click library's click.echo() function (which we already use in part of the codebase) has all the necessary work-arounds baked in.

So what I'm planning to do now is replace all occurrences where we are writing to stdout/stderr directly with the click.echo() function.

The 2 most relevant places:

  • In the display module
  • Writing a custom LogHandler that uses click.echo() to print (needed to fix unicode characters in log/debug messages).

Hopefully this will work!

jorisroovers avatar Aug 29 '19 13:08 jorisroovers

Quick example of something that doesn't work as expected yet: Try the following on Python 2.7 on Windows (this does work on Python 3.x).

echo WIP: tëst | gitlint

This will crash gitlint on a unicode detection error.

jorisroovers avatar Feb 25 '20 13:02 jorisroovers

Is it here I should write about a bug?

when I install gitlint hook by pre-commit in linux it works normally, but in windows it failed with error

gitlint..................................................................Failed
- hook id: gitlint
- exit code: 1

Traceback (most recent call last):
  File "C:\Users\metya\scoop\apps\miniconda3\4.7.12.1\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\metya\scoop\apps\miniconda3\4.7.12.1\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\Scripts\gitlint.EXE\__main__.py", line 7, in <module>
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\click\core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\click\core.py", line 717, in main
    rv = self.invoke(ctx)
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\click\core.py", line 1114, in invoke
    return Command.invoke(self, ctx)
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\click\core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\click\core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\click\decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\cli.py", line 204, in cli
    log_system_info()
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\cli.py", line 51, in log_system_info
    LOG.debug("Git version: %s", git_version())
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\git.py", line 57, in git_version
    return _git("--version").replace(u"\n", u"")
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\git.py", line 33, in _git
    result = sh.git(*command_parts, **git_kwargs)  # pylint: disable=unexpected-keyword-arg
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\shell.py", line 45, in git
    return _exec(*args, **kwargs)
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\shell.py", line 65, in _exec
    stdout = ustr(result[0])
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\utils.py", line 88, in ustr
    return obj.decode(DEFAULT_ENCODING)
LookupError: unknown encoding: C

Even if the commit message is perfectly valid.

metya avatar Sep 16 '20 03:09 metya

@metya, can you try setting LC_ALL to UTF-8 and let me know if that works for you?

# Regular windows CMD
Set LC_ALL=UTF-8

# git-bash/Cygwin
export LC_ALL=UTF-8

# Now try again

The reason this happens is because git sets LC_CTYPE=C on Windows when invoking /bin/sh, and Python can't find the C encoding. I've added a workaround in #158, adding some logic for gitlint to more smartly fall back to UTF-8. This will go out as part of the 0.14.0 release within the next month.

This doesn't solve all unicode issues on Windows (I've spend more time on it but no silver bullets...yet), but hopefully it should keep gitlint from crashing.

jorisroovers avatar Oct 17 '20 08:10 jorisroovers