commitizen `UnicodeDecodeError` if commit messages contain Unicode characters

Description

If I run

cz changelog

and the commit messages contain Unicode characters like 🤦🏻‍♂️ (which is an eight-byte utf-8 sequence: \xf0\x9f\xa4\xa6 \xf0\x9f\x8f\xbb) then I get the following traceback

Traceback (most recent call last):
  File "/.../.venv/bin/cz", line 8, in <module>
    sys.exit(main())
  File "/.../.venv/lib/python3.10/site-packages/commitizen/cli.py", line 389, in main
    args.func(conf, vars(args))()
  File "/.../.venv/lib/python3.10/site-packages/commitizen/commands/changelog.py", line 143, in __call__
    commits = git.get_commits(
  File "/.../.venv/lib/python3.10/site-packages/commitizen/git.py", line 98, in get_commits
    c = cmd.run(command)
  File "/.../.venv/lib/python3.10/site-packages/commitizen/cmd.py", line 32, in run
    stdout.decode(chardet.detect(stdout)["encoding"] or "utf-8"),
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/encodings/cp1254.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 1689: character maps to <undefined>

The result of chardet.detect() here

https://github.com/commitizen-tools/commitizen/blob/2ff9f155435b487057ce5bd8e32e1ab02fd57c94/commitizen/cmd.py#L26

is:

{'encoding': 'Windows-1254', 'confidence': 0.6864215607255395, 'language': 'Turkish'}

An interesting character encoding prediction with a low confidence, which in turn picks the incorrect codec and then decoding the bytes fails. Using decode("utf-8") works fine. It looks like issue https://github.com/chardet/chardet/issues/148 is related to this.

I think the fix would be something like this to replace these lines of code:

stdout, stderr = process.communicate()
return_code = process.returncode
try:
    stdout_s = stdout.decode("utf-8")  # Try this one first.
except UnicodeDecodeError:
    result = chardet.detect(stdout)  # Final result of the UniversalDetector’s prediction.
    # Consider checking confidence value of the result?
    stdout_s = stdout.decode(result["encoding"])
try:
    stderr_s = stderr.decode("utf-8")  # Try this one first.
except UnicodeDecodeError:
    result = chardet.detect(stderr)  # Final result of the UniversalDetector’s prediction.
    # Consider checking confidence value of the result?
    stderr_s = stderr.decode(result["encoding"])
return Command(stdout_s, stderr_s, stdout, stderr, return_code)

Steps to reproduce

Well I suppose you can add a few commits to a local branch an go crazy with much text and funky unicode characters (emojis with skin tones, flags, etc.), and then attempt to create a changelog.

Current behavior

cz throws an exception.

Desired behavior

cz creates a changelog.

Screenshots

No response

Environment

> cz version
2.29.3
> python --version
Python 3.10.5
> uname -a
Darwin pooh 18.7.0 Darwin Kernel Version 18.7.0: Mon Feb 10 21:08:45 PST 2020; root:xnu-4903.278.28~1/RELEASE_X86_64 x86_64 i386 Darwin

Aug 03 '22 10:08 jenstroeger

Related to https://github.com/commitizen-tools/commitizen/pull/522

Introducing chardet is creating some problems. I think if confidence is not enough (~.85) maybe we should propagate the error.

@jenstroeger do you think you can provide a PR to fix this?

Aug 03 '22 11:08 woile

@Woile a PR using the code above? If you think that’d be useful then yes.

Aug 03 '22 11:08 jenstroeger

Yes, I think the unicode tries could be encapsulated in a function try_decode so we keep the logic there. But the code looks good 👍🏻

Aug 03 '22 11:08 woile

I just ran into the same UnicodeDecodeError issue, but with an invisible width character (0x9d) that had been copied into the commit message when merging a PR (https://github.com/KyleKing/recipes/commit/b29291c045d9094bd2f33b4bfbca3cfdf4478f99)

I'm not sure if the change would work, but it might be good to add a test case for STDOUT like: bytes([0x73, 0xe2, 0x80, 0x9d]). When I tested locally, I had to set replace (stdout.decode(..., errors="replace"))

>>> chardet.detect(bytes([0x73, 0xe2, 0x80, 0x9d]))
{'encoding': 'utf-8', 'confidence': 0.505, 'language': ''}

Full Traceback

> poetry run cz changelog
Traceback (most recent call last):
  File "/Users/kyleking/Developer/recipes/.venv/bin/cz", line 8, in 
    sys.exit(main())
    │        └ 
    └ 
  File "/Users/kyleking/Developer/recipes/.venv/lib/python3.9/site-packages/commitizen/cli.py", line 389, in main
    args.func(conf, vars(args))()
    │         │          └ Namespace(debug=False, name=None, no_raise=None, dry_run=False, file_name=None, unreleased_version=None, incremental=False, rev_...
    │         └ 
    └ Namespace(debug=False, name=None, no_raise=None, dry_run=False, file_name=None, unreleased_version=None, incremental=False, rev_...
  File "/Users/kyleking/Developer/recipes/.venv/lib/python3.9/site-packages/commitizen/commands/changelog.py", line 143, in call
    commits = git.get_commits(
  File "/Users/kyleking/Developer/recipes/.venv/lib/python3.9/site-packages/commitizen/git.py", line 98, in get_commits
    c = cmd.run(command)
        │       └ 'git -c log.showSignature=False log --pretty=%H%n%s%n%an%n%ae%n%b----------commit-delimiter---------- --author-date-order '
        └ 
  File "/Users/kyleking/Developer/recipes/.venv/lib/python3.9/site-packages/commitizen/cmd.py", line 34, in run
    stdout.decode(_determine_encoding(stdout)),
    │             │                   └ b"2f50e7ede9997647127c19e4974282e239af3507\nbuild: bump calcipy and regenerate code tags\nKyle King\[email protected]...
    │             └ 
    └ b"2f50e7ede9997647127c19e4974282e239af3507\nbuild: bump calcipy and regenerate code tags\nKyle King\[email protected]...
  File "/usr/local/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/encodings/cp1254.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
           │                     │     │      └ '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$...
           │                     │     └ 'strict'
           │                     └ 
           └ 
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 32110: character maps to

Aug 05 '22 01:08 KyleKing

@KyleKing actually…

I'm not sure if the change would work, but it might be good to add a test case for STDOUT like: bytes([0x73, 0xe2, 0x80, 0x9d]).

That byte sequence encodes two characters:

>>> bytes([0x73, 0xe2, 0x80, 0x9d]).decode()
's”'

where the bytes e2 80 9d are the UTF-8 encoding of ” (U+201D, or “RIGHT DOUBLE QUOTATION MARK”). Injecting U+FFFD whenever a character can’t be decoded using the replace codec may have undesired consequences:

>>> bytes([0x73, 0xe2, 0x80]).decode(encoding="utf8", errors="replace")
's�'

which may be confusing to people—my personal preference would be failure.

Judging from your stacktrace, I see a

  File "/usr/local/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/encodings/cp1254.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
           │                     │     │      └ '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$...
           │                     │     └ 'strict'
           │                     └ 
           └ 
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 32110: character maps to

which indicates that decode() tries to interpret your bytes as Windows-1254 encoded string (note the encodings/cp1254.py where the exception originates), and that fails because it’s a UTF-8 encoded string:

>>> bytes([0x73, 0xe2, 0x80, 0x9d]).decode(encoding="utf-8")
's”'
>>> bytes([0x73, 0xe2, 0x80, 0x9d]).decode(encoding="windows-1254")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/encodings/cp1254.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3: character maps to <undefined>

Take a look at https://github.com/chardet/chardet/issues/148 for some more details, and at PR #545 which addresses this very issue (i.e. mispredicting a bytes object as Windows 1254 encoded string, instead of a UTF-8 encoded string).

Aug 05 '22 02:08 jenstroeger

Thanks for taking the time to write such a detailed reply. You're right, and I look forward to the more robust logic!

Aug 05 '22 03:08 KyleKing

this PR should fix, can you try it and see if the UnicodeDecodeError still happens ?

Aug 05 '22 15:08 gpongelli

@Woile is it reasonable to add a test with single emoji character as changelog?

Aug 06 '22 12:08 gpongelli

@Woile is it reasonable to add a test with single emoji character as changelog?

You mean someone create a commit with only emoji character and later be added to changelog and whether we should test this behavior? If so, I think we could do that. Adding test for edge cases won't harm

Aug 07 '22 02:08 Lee-W

@Woile is it reasonable to add a test with single emoji character as changelog?

You mean someone create a commit with only emoji character and later be added to changelog and whether we should test this behavior? If so, I think we could do that. Adding test for edge cases won't harm

Yes, I mean that. It’s the case of this bug.

Aug 07 '22 06:08 gpongelli

@gpongelli this issue isn’t really a bug and it’s not about emojis.

The problem in this issue is about a sequence of bytes which contains UTF-8 encoded text, but the bytes’ encoding is mispredicted as Windows 1254 encoding. Based on that misprediction commitizen picks the incorrect codec to decode/interpret the bytes and that fails.

Thus, my proposed solution is to try to decode the bytes using the UTF-8 codec first because that’s the common text encoding across platforms these days. Only if that fails, invoke some statistical analysis (e.g. chardet) to predict the text encoding (see also chardet FAQ).

Python encodes text as UTF-8 by default, but it also provides a large number of other text codecs you should consider when testing. I think, though, that UTF-8 is the common default encoding these days on many platforms.

Aug 07 '22 07:08 jenstroeger

As per PR https://github.com/commitizen-tools/commitizen/issues/545 I rebased my branch. Do as you wish with this fix.

diff --git a/commitizen/cmd.py b/commitizen/cmd.py
index 7f4efb6..71fbe8f 100644
--- a/commitizen/cmd.py
+++ b/commitizen/cmd.py
@@ -3,6 +3,8 @@ from typing import NamedTuple
 
 from charset_normalizer import from_bytes
 
+from commitizen.exceptions import CharacterSetDecodeError
+
 
 class Command(NamedTuple):
     out: str
@@ -12,6 +14,17 @@ class Command(NamedTuple):
     return_code: int
 
 
+def _try_decode(bytes_: bytes) -> str:
+    try:
+        return bytes_.decode("utf-8")
+    except UnicodeDecodeError:
+        charset_match = from_bytes(bytes_).best()
+        try:
+            return bytes_.decode(charset_match.encoding)
+        except UnicodeDecodeError as e:
+            raise CharacterSetDecodeError() from e
+
+
 def run(cmd: str) -> Command:
     process = subprocess.Popen(
         cmd,
@@ -23,8 +36,8 @@ def run(cmd: str) -> Command:
     stdout, stderr = process.communicate()
     return_code = process.returncode
     return Command(
-        str(from_bytes(stdout).best()),
-        str(from_bytes(stderr).best()),
+        _try_decode(stdout),
+        _try_decode(stderr),
         stdout,
         stderr,
         return_code,
diff --git a/commitizen/exceptions.py b/commitizen/exceptions.py
index a95ab3b..16869b5 100644
--- a/commitizen/exceptions.py
+++ b/commitizen/exceptions.py
@@ -26,6 +26,7 @@ class ExitCode(enum.IntEnum):
     INVALID_CONFIGURATION = 19
     NOT_ALLOWED = 20
     NO_INCREMENT = 21
+    UNRECOGNIZED_CHARACTERSET_ENCODING = 22
 
 
 class CommitizenException(Exception):
@@ -148,3 +149,7 @@ class InvalidConfigurationError(CommitizenException):
 
 class NotAllowed(CommitizenException):
     exit_code = ExitCode.NOT_ALLOWED
+
+
+class CharacterSetDecodeError(CommitizenException):
+    exit_code = ExitCode.UNRECOGNIZED_CHARACTERSET_ENCODING
diff --git a/docs/exit_codes.md b/docs/exit_codes.md
index f4c2fa8..a0448a2 100644
--- a/docs/exit_codes.md
+++ b/docs/exit_codes.md
@@ -28,4 +28,5 @@ These exit codes can be found in `commitizen/exceptions.py::ExitCode`.
 | InvalidCommandArgumentError | 18        | The argument provide to command is invalid (e.g. `cz check -commit-msg-file filename --rev-range master..`) |
 | InvalidConfigurationError   | 19        | An error was found in the Commitizen Configuration, such as duplicates in `change_type_order`               |
 | NotAllowed                  | 20        | `--incremental` cannot be combined with a `rev_range`                                                       |
-| NoneIncrementExit           | 21        | The commits found are not elegible to be bumped                                                             |
+| NoneIncrementExit           | 21        | The commits found are not eligible to be bumped                                                             |
+| CharacterSetDecodeError     | 22        | The character encoding of commits could not be determined

And for laughs the above examples:

>>> charset_normalizer.from_bytes(b'').best().encoding
'utf_8'
>>> charset_normalizer.from_bytes(bytes([0x73, 0xe2, 0x80, 0x9d])).best().encoding
'cp775'  # Mispredicted valid utf-8
>>> charset_normalizer.from_bytes(bytes([0x73, 0xe2, 0x80])).best().encoding
'cp037'

Aug 07 '22 09:08 jenstroeger

I have a similar issue, but when committing. I use several pre-commit hooks (flake8, mypy, isort and black), and black outputs a message with emojis (All done! ✨ 🍰 ✨) when a file is reformatted. Commitizen breaks on this with the decoding error...

Traceback (most recent call last):
 File "/Users/aleksandra/Projects/Metamaze/metamaze-ml/venv/venv-layoutlm/bin/cz", line 8, in <module>
   sys.exit(main())
 File "/Users/aleksandra/Projects/Metamaze/metamaze-ml/venv/venv-layoutlm/lib/python3.8/site-packages/commitizen/cli.py", line 389, in main
   args.func(conf, vars(args))()
 File "/Users/aleksandra/Projects/Metamaze/metamaze-ml/venv/venv-layoutlm/lib/python3.8/site-packages/commitizen/commands/commit.py", line 86, in __call__
   c = git.commit(m)
 File "/Users/aleksandra/Projects/Metamaze/metamaze-ml/venv/venv-layoutlm/lib/python3.8/site-packages/commitizen/git.py", line 76, in commit
   c = cmd.run(f"git commit {args} -F {f.name}")
 File "/Users/aleksandra/Projects/Metamaze/metamaze-ml/venv/venv-layoutlm/lib/python3.8/site-packages/commitizen/cmd.py", line 27, in run
   stderr.decode(chardet.detect(stderr)["encoding"] or "utf-8"),
 File "/usr/local/bin/../../../Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/encodings/cp1254.py", line 15, in decode
   return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 414: character maps to <undefined>

I can only use cz c in case black did not detect any issues. It's a bit annoying, since I cannot take full advantage of commitizen, I basically have to commit twice to avoid black interfering with commitizen.

Aug 09 '22 10:08 alvercau

@alvercau can you please reformat the output as distinct code block using ``` instead of inlining? That traceback is very hard to read.

However, judging from the last line:

  File "/usr/local/bin/../../../Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/encodings/cp1254.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 414: character maps to <undefined>

I’m confused, though, how black’s output ends up in your commit message?

Aug 09 '22 10:08 jenstroeger

@jenstroeger it does not end up in the commit message, it ends up in stderr, which is decoded by commitizen.

Aug 09 '22 11:08 alvercau

@jenstroeger it does not end up in the commit message, it ends up in stderr, which is decoded by commitizen.

Oh… running cz commit runs git commit which runs git hooks which runs black which writes to stdout which cz captures and can’t decode. Gotcha 👍🏼

PR #552 should fix that.

Aug 09 '22 11:08 jenstroeger

@alvercau v2.29.6 just shipped with the fix.

Aug 13 '22 08:08 jenstroeger