pygit2 icon indicating copy to clipboard operation
pygit2 copied to clipboard

Python 3 port uses Unicode to represent byte strings

Open wgrant opened this issue 10 years ago • 6 comments

pygit2, when built for Python 3, treats paths as Unicode and will fail if a path isn't decodable as the filesystem encoding. But Git paths are byte strings, not Unicode strings. This includes refs, so repos with branch names containing non-UTF-8 sequences are completely unusable on most systems:

>>> repo.listall_references()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 14: invalid start byte

pygit2's behaviour under Python 2 is correct; listall_references and other APIs returns byte strings as they are in the Git model. I don't see why the behaviour should differ by Python version, as both types exist in both languages. It seems to me that the default low-level API should return byte strings to match the underlying model and handle all cases, and convenience wrappers which return Unicode strings could be added if people actually want them. As it stands, some perfectly valid Git repos are unusable except on Python 2.

wgrant avatar Jun 17 '15 23:06 wgrant

The most reasonable solution I can think of that retains backward compatibility is to add encoding and errors properties to Repository, defaulting to the filesystem encoding and "strict" but overridable (even to None).

wgrant avatar Jun 18 '15 00:06 wgrant

Hi,

I can confirm this issue. This also happens for mis-encoded metadata in commits, like commit dcb71129841e5821c0cbbdd4017a6f202f180108 in the Linux kernel (look at the author name):

Reconstruction:

import pygit2
repo = pygit2.Repository('.')
commit = repo['dcb71129841e5821c0cbbdd4017a6f202f180108']
commit.author.name

Raises:

In [7]: commit.author.name
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-7-e2f11dcffc49> in <module>()
----> 1 commit.author.name

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 10: invalid start byte

`

rralf avatar Jul 27 '18 11:07 rralf

My local workaround is to use the raw members of the classes, and then let them through:

def fix_encoding(string):
    try:
        string = string.decode('utf-8')
    except:
        string = string.decode('iso8859')
    return string

rralf avatar Jul 27 '18 11:07 rralf

At least for paths, to convert C string into Python str object, it can be used PyUnicode_DecodeFSDefault() as to_path() in src/utils.h (, and I'm happy if its errors handler is "surrogateescape").

Also, it is natulal if conversion from path represented in Python bytes object to Python str object, e.g. in Repository.__init__() is done by os.fsdecode()

futatuki avatar Nov 06 '22 19:11 futatuki

In the latest release we're using PyUnicode_DecodeFSDefault() and PyUnicode_EncodeFSDefault() It remains to use os.fsdecode() (PRs welcome).

jdavid avatar May 29 '24 14:05 jdavid

While I tried to blame a file, which file name is not valid UTF-8 sequence in bytes, on environment sys.getfilesystemencoding() == 'utf-8', by passing str filename encoded with encoding='utf-8' and errors='surrogateescape' to pygit2.Repository.blame(), I got UnicodeEncodeError. This can be resolved by modification like:

diff --git a/pygit2/repository.py b/pygit2/repository.py
index 509b4d0..1d4c573 100644
--- a/pygit2/repository.py
+++ b/pygit2/repository.py
@@ -27,6 +27,7 @@ import tarfile
 import warnings
 from collections.abc import Callable, Iterator
 from io import BytesIO
+from os import fsencode
 from pathlib import Path
 from string import hexdigits
 from time import time
@@ -717,7 +718,7 @@ class BaseRepository(_Repository):
             options.max_line = max_line
 
         cblame = ffi.new('git_blame **')
-        err = C.git_blame_file(cblame, self._repo, to_bytes(path), options)
+        err = C.git_blame_file(cblame, self._repo, fsencode(path), options)
         check_error(err)
 
         return Blame._from_c(self, cblame[0])

I think there are some other places in the file that same fix is needed, but I only encountered in the case of blame() in my usage, so I point out this only.

futatuki avatar Aug 26 '25 05:08 futatuki