Python 3 port uses Unicode to represent byte strings
pygit2, when built for Python 3, treats paths as Unicode and will fail if a path isn't decodable as the filesystem encoding. But Git paths are byte strings, not Unicode strings. This includes refs, so repos with branch names containing non-UTF-8 sequences are completely unusable on most systems:
>>> repo.listall_references()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 14: invalid start byte
pygit2's behaviour under Python 2 is correct; listall_references and other APIs returns byte strings as they are in the Git model. I don't see why the behaviour should differ by Python version, as both types exist in both languages. It seems to me that the default low-level API should return byte strings to match the underlying model and handle all cases, and convenience wrappers which return Unicode strings could be added if people actually want them. As it stands, some perfectly valid Git repos are unusable except on Python 2.
The most reasonable solution I can think of that retains backward compatibility is to add encoding and errors properties to Repository, defaulting to the filesystem encoding and "strict" but overridable (even to None).
Hi,
I can confirm this issue. This also happens for mis-encoded metadata in commits, like commit dcb71129841e5821c0cbbdd4017a6f202f180108 in the Linux kernel (look at the author name):
Reconstruction:
import pygit2
repo = pygit2.Repository('.')
commit = repo['dcb71129841e5821c0cbbdd4017a6f202f180108']
commit.author.name
Raises:
In [7]: commit.author.name
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-7-e2f11dcffc49> in <module>()
----> 1 commit.author.name
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 10: invalid start byte
`
My local workaround is to use the raw members of the classes, and then let them through:
def fix_encoding(string):
try:
string = string.decode('utf-8')
except:
string = string.decode('iso8859')
return string
At least for paths, to convert C string into Python str object, it can be used PyUnicode_DecodeFSDefault() as to_path() in src/utils.h (, and I'm happy if its errors handler is "surrogateescape").
Also, it is natulal if conversion from path represented in Python bytes object to Python str object, e.g. in Repository.__init__() is done by os.fsdecode()
In the latest release we're using PyUnicode_DecodeFSDefault() and PyUnicode_EncodeFSDefault()
It remains to use os.fsdecode() (PRs welcome).
While I tried to blame a file, which file name is not valid UTF-8 sequence in bytes, on environment sys.getfilesystemencoding() == 'utf-8', by passing str filename encoded with encoding='utf-8' and errors='surrogateescape' to pygit2.Repository.blame(), I got UnicodeEncodeError. This can be resolved by modification like:
diff --git a/pygit2/repository.py b/pygit2/repository.py
index 509b4d0..1d4c573 100644
--- a/pygit2/repository.py
+++ b/pygit2/repository.py
@@ -27,6 +27,7 @@ import tarfile
import warnings
from collections.abc import Callable, Iterator
from io import BytesIO
+from os import fsencode
from pathlib import Path
from string import hexdigits
from time import time
@@ -717,7 +718,7 @@ class BaseRepository(_Repository):
options.max_line = max_line
cblame = ffi.new('git_blame **')
- err = C.git_blame_file(cblame, self._repo, to_bytes(path), options)
+ err = C.git_blame_file(cblame, self._repo, fsencode(path), options)
check_error(err)
return Blame._from_c(self, cblame[0])
I think there are some other places in the file that same fix is needed, but I only encountered in the case of blame() in my usage, so I point out this only.