Unicode decoding issue on Windows
When trying to decode Unicode characters on Windows, gitlint can crash. This can easily be shown by trying to lint the commit 3ee281eec4ef4325ce90d27f6f368a6b95818cfe of the gitlint commit history.
(.venv) C:\Users\Administrator\gitlint>gitlint --commits 3ee281eec4ef4325ce90d27f6f368a6b95818cfe
Traceback (most recent call last):
File "C:\Users\Administrator\gitlint\gitlint\.venv\Scripts\gitlint-script.py", line 11, in <module>
load_entry_point('gitlint', 'console_scripts', 'gitlint')()
File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\core.py", line 717, in main
rv = self.invoke(ctx)
File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\core.py", line 1114, in invoke
return Command.invoke(self, ctx)
File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\core.py", line 555, in invoke
return callback(*args, **kwargs)
File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "c:\users\administrator\gitlint\gitlint\cli.py", line 180, in cli
ctx.invoke(lint)
File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\core.py", line 555, in invoke
return callback(*args, **kwargs)
File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "c:\users\administrator\gitlint\gitlint\cli.py", line 211, in lint
gitcontext = GitContext.from_local_repository(lint_config.target, ctx.obj[2])
File "c:\users\administrator\gitlint\gitlint\git.py", line 199, in from_local_repository
raw_commit = _git("log", sha, "-1", long_format, _cwd=repository_path).split("\n")
File "c:\users\administrator\gitlint\gitlint\git.py", line 28, in _git
result = sh.git(*command_parts, **git_kwargs) # pylint: disable=unexpected-keyword-arg
File "c:\users\administrator\gitlint\gitlint\shell.py", line 45, in git
return _exec(*args, **kwargs)
File "c:\users\administrator\gitlint\gitlint\shell.py", line 65, in _exec
stdout = ustr(result[0])
File "c:\users\administrator\gitlint\gitlint\utils.py", line 27, in ustr
return unicode(obj, DEFAULT_ENCODING) # pragma: no cover # noqa
File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\encodings\cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1: character maps to <undefined>
I spend some time looking into this. Disclaimer: I'm by no means a unicode expert and what follows is based on limited research. If anyone reading this has more expertise in this area, please chime in!
TLDR: A fix will follow that will takes into account the LC_ALL, LC_CTYPE, LANG env vars on Windows and otherwise fallsback to UTF-8. This matches git's behavior on Windows.
In a nutshell, the problem is here https://github.com/jorisroovers/gitlint/blob/master/gitlint/utils.py#L8
In particular, on my windows 10 VM, Python 2.7, both CMD and Powershell:
$ python -c "import locale; print locale.getpreferredencoding()"
cp1252
This causes problems with gitlint tries to parse unicode characters encoded in UTF-8 (or any encoding other than cp1252).
I first considered to just remove the call to locale.getpreferredencoding() and always use UTF-8. However, while I believe that works for the majority of users, I don't want to downgrade the experience for power-users. For users that want/need an encoding different from UTF-8, this is probably a hard requirement to be able to use gitlint at all.
I then considered to hard-code UTF-8 only on Windows (and keep the current code on other platforms), but after some more research I discovered that git actually adheres to the LC_ALL environment variable on Windows (see footnote 1) - which allows users to define which charset to use. Note that reading the LC_ALL envvar is somewhat atypical for Windows. I personally think this is a rather useful workaround and given gitlint's dependency on git itself, it makes sense to follow the same behavior.
So in a nutshell, what I'm thinking to implement now is:
- On non-Windows machines, everything stays as is
- On Windows-machines:
- Try to determine preferred encoding from environment variables
LC_ALL,LC_CTYPE,LANG(in that order) - Otherwise, fallback on
UTF-8 - Ignore
locale.getpreferredencoding()all-together (like git).
- Try to determine preferred encoding from environment variables
Footnotes
Footnote 1: git itself requires LC_ALL on Windows
As a side-note, git itself requires you to explicitly set LC_ALL on Windows to properly display unicode characters, like so:
# Without 'LC_ALL=C.UTF-8'
$ git log -1 --pretty="%an" 3ee281
<C5><81>ukasz Rogalski
# With 'LC_ALL=C.UTF-8'
$ set LC_ALL=C.UTF-8
$ git log -1 --pretty="%an" 3ee281
Łukasz Rogalski
I believe this is the relevant source-code in git itself: https://github.com/git/git/blob/5fa0f5238b0cd46cfe7f6fa76c3f526ea98148d9/gettext.c#L15-L32
Footnote 2:
From what I could gather, there's a lot discussion about python's unicode character decoding on Windows. In particular, from diagonally reading a relevant blog post, I eventually stumbled upon PEP 528 which explains how in Python >= 3.6, Python defaults to UTF-8 for Windows console encoding. More indication that Unicode handling in python is messy, particularly so on Windows.
Footnote 3:
Python allows you to specify the error behavior for unicode errors (Python2, Python3). While it's possible to implement a git-like behavior of printing placeholder chars on unicode decoding errors, I'm probably not going to do this for now. I'd prefer gitlint to hard-crash on decoding issues for now so it's more likely that users report the issues they encounter.
The plot thickens!
In my previous comment, I really only considered reads of unicode characters from the git command output, i.e. the cause of the crash in the original description. That issue is mostly solved by c939a0d913d96fbc308399dfaa931740a0db2684 (although I need to amend that commit with a small fix).
However, turns out that writing unicode characters to the Windows console is an entire beast on its own - see issue1602. In a nutshell, properly and consistently printing Unicode characters to the Windows console is very messy in python. From what I can gather, manually working around this is complicated.
The good news is that the Click library's click.echo() function (which we already use in part of the codebase) has all the necessary work-arounds baked in.
So what I'm planning to do now is replace all occurrences where we are writing to stdout/stderr directly with the click.echo() function.
The 2 most relevant places:
- In the display module
- Writing a custom LogHandler that uses
click.echo()to print (needed to fix unicode characters in log/debug messages).
Hopefully this will work!
Quick example of something that doesn't work as expected yet: Try the following on Python 2.7 on Windows (this does work on Python 3.x).
echo WIP: tëst | gitlint
This will crash gitlint on a unicode detection error.
Is it here I should write about a bug?
when I install gitlint hook by pre-commit in linux it works normally, but in windows it failed with error
gitlint..................................................................Failed
- hook id: gitlint
- exit code: 1
Traceback (most recent call last):
File "C:\Users\metya\scoop\apps\miniconda3\4.7.12.1\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\metya\scoop\apps\miniconda3\4.7.12.1\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\Scripts\gitlint.EXE\__main__.py", line 7, in <module>
File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\click\core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\click\core.py", line 717, in main
rv = self.invoke(ctx)
File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\click\core.py", line 1114, in invoke
return Command.invoke(self, ctx)
File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\click\core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\click\core.py", line 555, in invoke
return callback(*args, **kwargs)
File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\click\decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\cli.py", line 204, in cli
log_system_info()
File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\cli.py", line 51, in log_system_info
LOG.debug("Git version: %s", git_version())
File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\git.py", line 57, in git_version
return _git("--version").replace(u"\n", u"")
File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\git.py", line 33, in _git
result = sh.git(*command_parts, **git_kwargs) # pylint: disable=unexpected-keyword-arg
File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\shell.py", line 45, in git
return _exec(*args, **kwargs)
File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\shell.py", line 65, in _exec
stdout = ustr(result[0])
File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\utils.py", line 88, in ustr
return obj.decode(DEFAULT_ENCODING)
LookupError: unknown encoding: C
Even if the commit message is perfectly valid.
@metya, can you try setting LC_ALL to UTF-8 and let me know if that works for you?
# Regular windows CMD
Set LC_ALL=UTF-8
# git-bash/Cygwin
export LC_ALL=UTF-8
# Now try again
The reason this happens is because git sets LC_CTYPE=C on Windows when invoking /bin/sh, and Python can't find the C encoding. I've added a workaround in #158, adding some logic for gitlint to more smartly fall back to UTF-8. This will go out as part of the 0.14.0 release within the next month.
This doesn't solve all unicode issues on Windows (I've spend more time on it but no silver bullets...yet), but hopefully it should keep gitlint from crashing.