gitinspector
gitinspector copied to clipboard
UTF-8 characters in authors leads to a crash
Hi,
we had names with german letters like 'ö' inside. They terminated the application on text output.
To correct the bug, I've changed line 151 in changesoutput.py to the following line:
print(str(i.encode(sys.stdout.encoding, errors='replace')).ljust(20), end=" ")
and imported sys.
HTH, Chris
Hi.
What encoding is the terminal ? What version of Python are you using? What exception are you getting?
You are doing errors="replace" here which works around the problem and doesn't really "solve" it. I suspect your terminal is not really set to UTF-8 and this is the actual reason for your issues.
Hi, sorry for late reply.
Encoding: $ python -c "import sys; print(sys.stdout.encoding)" ISO-8859-1
Python version: $ python --version Python 2.7.6
Error message:
Traceback (most recent call last):
File "/home/user/Downloads/gitinspector-master/gitinspector.py", line 24, in
With the following settings, not the right character appears, but it works: $ export LC_ALL=de_DE.utf8 $ export LANG="$LC_ALL"
Thanks, Chris
I have the same issue beside I have UTF-8 locale
$ gitinspector --version
...
...
raise ValueError, 'unknown locale: %s' % localename
ValueError: unknown locale: UTF-8
I would say that this could solve.
Add to ~/.bash_profile
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
or whatever preference you want. Seems just a generic issue with Python on Mac.
@CFoltin Hi.
Sorry. I forgot about this issue. In any case - it's completely normal. If your terminal can't support the character, python has no way of outputting it.
The export should do the trick though. But maybe something is still not set to UTF-8. You can try setting PYTHIONIOENCODING to utf8 or redirecting to a file - in which case these problems should never occur.
Ironically I just hit this same issue today. I don't think it's "invalid", tbh.
imo gitinspector shouldn't crash just because it hits an odd character in git metadata...
Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner self.run() File "/usr/local/lib/python2.7/dist-packages/gitinspector/blame.py", line 113, in run self.handle_blamechunk_content(row) File "/usr/local/lib/python2.7/dist-packages/gitinspector/blame.py", line 81, in handle_blamechunk_content author = self.changes.get_latest_author_by_email(self.blamechunk_email) File "/usr/local/lib/python2.7/dist-packages/gitinspector/changes.py", line 186, in get_latest_author_by_email name = name.decode("unicode_escape", "ignore") UnicodeEncodeError: 'ascii' codec can't encode character u'\u0153' in position 15: ordinal not in range(128)
@devcurmudgeon
There are only a few options here.
- Ignore it and replace all characters that can not be outputted - this is something I'd rather not do, as the output will be invalid.
- Catch it and print out exactly the same error message ;)... Or a similar one. No point in that.
- Catch it, inform the user and output it with invalid characters replaced. Problem is that this will garble the output in the terminal (even if you use stderr for the warning).
This has been discussed so many times before (and not only in this project, mind you). I think it's better to just leave it, as these exceptions are very informative in python - it's also a common issue. If you plan on outputting unicode characters you best have a terminal set up to handle it. In your case, it's configured for ascii output and you are trying to output a œ character.
We actually have the following function,
https://github.com/ejwa/gitinspector/blob/6d77989e341e043c9a7f09757000d75701b32d84/gitinspector/terminal.py#L128
This warns on mis-configured terminals that return "None" as encoding. However, it does not warn on ascii. Ascii may actually be OK if you happen to have a repo that only outputs standard ascii characters when you run gitinspector.
Hi, this problem had cost me about 3 hours to get fixed. So, I would propose to improve the user's experience. The option to catch it and to print a remedia would be way better to leave the user with a stack trace, which appears after ~20min which the tool needed in my case to analyze the repo.
Just, my 2 cents.
BTW: In any case, this error message seems to be found....
@CFoltin i'm with you. I was thinking of using gitinspector as part of a CI pipeline that builds a custom Linux distro ... approximately 700 repos. it worked well on some samples... but then crashed our pipeline, on a linux-api-headers repo, a couple of hours into the run.
Upstream folks obviously can make make their own choices about what they want to fix, but as a user I'm not interested in informative python stack-traces, I just want working software :-)
Note - i've taken plenty of heat from users moaning about stack traces in my own projects :-)
@adam-waldenberg thanks for your reply. I may be wrong but I think you missed at least a couple of options:
- force utf-8 if the environment is not set
- skip any data that you can't handle but still generate results for all the rest of the data
Crashing a whole run because of an unexpected (but valid) character in a git repo's metadata doesn't seem like correct behaviour to me
@devcurmudgeon UTF-8 is forced on redirection. I won't be forcing UTF-8 on terminal output, because it's not always needed. Output also needs to work on other environments with extended UTF-8, UTF-16 etc. Strictly speaking, it's only author names (and sometimes filenames) that can be an issue. Again, skipping data would mean you get an invalid output, which is not an option either.
@CFoltin You can only catch it once you encounter it, so even if you catch it, it would still take time before you know about it. Also, depending on what is wrong with the environment, there are a number of fixes that may or may not work.
In the end, it comes down to the fact that you can't know for sure what character set you may encounter in the repository. It can even be several ones.
One option I can see that I could live with is to catch it and print it out with replaced characters... We could then add a disclaimer at the end of the output stating that the output is not 100% correct and that it had to be modified in order to accommodate the terminal charset. However, I'm afraid it would raise even more questions though, as you know longer have the python exception to search on. Alternatively, the first/last exception encountered could also be included in the disclaimer.
I have decided to catch this exception and let the error message point to some of the issues here on the project page. This should let people that run into this problem to more effectively understand it and remedy it.
I have received a similar error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u011f' in position 380: character maps to
Found the solution: https://stackoverflow.com/a/57134096/1959766
@banbar Thank you. I don't think that Windows-specific solution has been covered anywhere on the issue tracker so far. I know it's more a Python and terminal thing than it is a gitinspector thing, but it I'm considering doing a F.A.Q/Wiki with common environment related issues that can be encountered. Maybe link into the issue tracker etc.
Now and again this (or related issues) keep coming up.