GitPython
GitPython copied to clipboard
Encoding problem
I'm not sure this is a proper way to use TagReferences, but it's definitely unexpected. This time I'm using GitPython installed by pypi.
I have this nice tag:
In [8]: tag
Out[8]: <git.TagReference "refs/tags/PROMOTED_1501131729_MKT15_01_12_QU_1">
I can get a lot of info out of it:
In [9]: tag.object.hexsha
Out[9]: u'dca63c5c7e6aab3cd4934e60230ec3419ab87071'
In [12]: tag.name
Out[12]: 'PROMOTED_1501131729_MKT15_01_12_QU_1'
In [13]: tag.object
Out[13]: <git.TagObject "dca63c5c7e6aab3cd4934e60230ec3419ab87071">
In [14]: tag.ref
TypeError: PROMOTED_1501131729_MKT15_01_12_QU_1 is a detached symbolic reference as it points to 'dca63c5c7e6aab3cd4934e60230ec3419ab87071'
But this fails:
In [15]: tag.commit
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-15-2431a6e80cf9> in <module>()
----> 1 tag.commit
/home/mdione/local/lib/python2.7/site-packages/git/refs/tag.pyc in commit(self)
29 elif obj.type == "tag":
30 # it is a tag object which carries the commit as an object - we can point to anything
---> 31 return obj.object
32 else:
33 raise ValueError("Tag %s points to a Blob or Tree - have never seen that before" % self)
/home/mdione/local/lib/python2.7/site-packages/gitdb/util.pyc in __getattr__(self, attr)
--> 237 self._set_cache_(attr)
238 # will raise in case the cache was not created
239 return object.__getattribute__(self, attr)
/home/mdione/local/lib/python2.7/site-packages/git/objects/tag.pyc in _set_cache_(self, attr)
54 if attr in TagObject.__slots__:
55 ostream = self.repo.odb.stream(self.binsha)
---> 56 lines = ostream.read().decode(defenc).splitlines()
57
58 obj, hexsha = lines[0].split(" ") # object <hexsha>
/usr/lib/python2.7/encodings/utf_8.pyc in decode(input, errors)
14
15 def decode(input, errors='strict'):
---> 16 return codecs.utf_8_decode(input, errors, True)
17
18 class IncrementalEncoder(codecs.IncrementalEncoder):
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa8 in position 108: invalid start byte
Unluckily this is happening with an internal repo and I don't know how to even try to reproduce with a public one. Meanwhile I can workaround it by using tag.object.hexsha, which is what I wanted.
Thanks for posting this issue !
It seems that the TagObject's information can't be decoded as it contains a non-utf-8 encoding which is unexpected. Maybe it is safer to not attempt to decode anything, and leave that to the client, who could read the bytes of the associated tag-object and parse them with a suitable encoding in mind.
Even though you have already discovered a workaround, the original problem remains. A proper fix would re-evaluate the current code and prefer to work on bytes instead of a decoded string.
In fact, no, tag.object.hexsha is not what I'm looking for.
More data: technically this is an encoding error in the data itself:
In [11]: stream= pricing.odb.stream(tag.object.binsha)
In [12]: stream.read()
Out[12]: 'object 4b50858c4debda3ad5d6ea5b7a485cd4eb5ecc73\ntype commit\ntag PROMOTED_1501131729_MKT15_01_12_QU_1\ntagger \xa8John Doe <[email protected]> 1421167136 +0100\n\nMerged CIU_MKT1501_28 to remote master\n'
You can see the offensive character just before the tagger's name (technically being part of it). In the other hand, I don't know even if git handles this, but what happens when different objects are encoded with different encodings? I'm pretty sure git objects do not store this kind of info...
Using the tag.object it should be straightforward to obtain the raw-bytes stored in the tag-object, in case this is what you are actually looking for. Those represent a few formatted lines of information, which could be parsed with code similar to the one currently in use.
Parsing can only safely operate on bytes though, as the encoding seems not to be UTF-8 at all times.
Even if parsing is made to work at some point, right now the tagger-name are expected to be str/unicode instances, which couldn't be obtained if the encoding of the underlying bytes are unknown.
What about using decode(defenc, 'ignore')? I hope it doesn't break anything else. I'll try that locally.
Great idea ! Of course it's questionable whether the program should silently drop information, instead of loudly abort operation as it currently does. It seems that it's generally unwise to make assumptions about the encoding in TagObjects, so the implementation should leave it to the client to deal with that and provide byte-strings only.
But that would be against your policy of handling as much as possible as unicode (if I correctly understood #312)...
BTW, that fixed my particular problem, but I guess you don't want the PR just yet...
But how would you want to produce proper unicode strings if the encoding is unclear ? It's unsafe to try it, which is showing in this example. The truth is that I am not entirely sure how git itself handles encodings, and it might be that GitPython actually went down a wrong path by trying to just decode textual data as UTF-8. The latter works most of the time, but that's not really good enough.
Maybe a suitable solution would be to allow the client to set the decode-behaviour on a per-repository basis to control whether .decode(defenc, 'ignore') is acceptable.
Doing this sounds like quite some work - and as it stands, the unicode handling in GitPython seems flawed by design :(.
I think git just doesn't handle encoding at all. In any case, any free form byte sequences (strings) are strings for user consumption: tag names, logs comments, etc. Even filenames are, I'm sure, not converted in any way. In fact, most (Unix/Linux) filesystems know nothing about encoding: it's possible to handle filenames encoded in one encoding in a system using another encoding, simply because filenames are treated as byte sequences with no specific meaning or encoding.
I have encountered similar problem - when invoking diff on a file that contains wrong utf8 sequence in this locale, GitPython fails with UnicodeDecodeError. Backtrace follows:
File "/usr/lib/python2.7/site-packages/gitupstream/gitupstream.py", line 175, in update
diff = self._repo.git.diff('--full-index', self._mainline, self._rebased)
File "/usr/lib/python2.7/site-packages/git/cmd.py", line 431, in
Will this issue include my error or I need to create another one? Maybe you could help me with the solution?
@StyXman You are totally right. As stated previously, fixing this in GitPython may be a breaking change to some, as bytes would be returned instead of unicode. This make me somewhat reluctant to attempt such a change, but I should check how much is actually affected.
@CepGamer You can pass the stdout_as_string=False keyword argument when executing .git.diff (i.e. .git.diff(..., stdout_as_string=False)), or use GitPython's own diffing facilities.
I believe I ran into a similar issue. When querying the commit message for a commit, the following exception is thrown:
ERROR:git.objects.commit:Failed to decode message '...' using encoding UTF-8
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/git/objects/commit.py", line 500, in _deserialize
self.message = self.message.decode(self.encoding)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf8 in position 126: invalid start byte
Unfortunately, I cannot share the exact commit message or the repository. I did not succeed in reproducing it in a test repository. Perhaps an option would be to provide an option to disable decoding?
A new release was just made to pypi 😁 (see #298) !
Git does store the encoding, if I understand this correctly: https://git-scm.com/docs/git-commit (Discussion section pretty much at the bottom). The relevant statements from that section are:
-
The contents of the blob objects are uninterpreted sequences of bytes. There is no encoding translation at the core level.
-
Commit log messages are typically encoded in UTF-8, but other extended ASCII encodings are also supported. This includes ISO-8859-x, CP125x and many others, but not UTF-16/32, EBCDIC and CJK multi-byte encodings (GBK, Shift-JIS, Big5, EUC-x, CP9xx etc.).
-
The way to say this [i.e. the encoding] is to have
i18n.commitencodingin.git/configfile, like this:[i18n] commitencoding = ISO-8859-1 -
Commit objects created with the above setting record the value of
i18n.commitencodingin itsencodingheader. This is to help other people who look at them later. Lack of this header implies that the commit log message is encoded in UTF-8.
The last statement in the list above is the key here. Assuming the GitPython code can access the encoding header (sorry, I'm new to GitPython development), it can safely determine the encoding, because lack of the header specifically means UTF-8. That could then be specified as the encoding in the decode() call that failed in this issue here. I think that would be superior to an approach that treats the commit message as bytes.
That still does not address the original issue of illegal characters in the encoding that was used. That could be addressed by using errors='replace' in the decode() call.
I fully support what @andy-maier just said here.
:-( I stumbled upon this one too. I was waiting for the v2.0.8 release in hope of the fix. Please take this one seriously, if possible, for the v2.0.9.
The .decode() call on git/objects/tag.py:56 should get a 'replace' arg to fix this issue. @ppietrasa can you try out that fix and report if that works? If so, can you make a PR for it?
I have run into this problem.
This script, which tries to loop through the tags of the nodejs/node repository, exposes this bug:
https://gist.github.com/sbenthall/14c4d14c00876440ba6d0ae62efa432f
Using version 2.1.11
I have the same essue, when reading branches property, how to solve it?
I have the same essue, when reading branches property, how to solve it?
repo = Repo(r'') print(repo.branches)
I have a similar question.
Traceback (most recent call last):
File "D:/Python/src/post.py", line 16, in <module>
print(repo.branches)
File "D:\Programs\Python37\lib\site-packages\git\repo\base.py", line 289, in heads
return Head.list_items(self)
File "D:\Programs\Python37\lib\site-packages\git\util.py", line 922, in list_items
out_list.extend(cls.iter_items(repo, *args, **kwargs))
File "D:\Programs\Python37\lib\site-packages\git\refs\symbolic.py", line 616, in _iter_items
for _sha, rela_path in cls._iter_packed_refs(repo):
File "D:\Programs\Python37\lib\site-packages\git\refs\symbolic.py", line 91, in _iter_packed_refs
for line in fp:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 538: illegal multibyte sequence
I have a similar question too🥲
pr_repo = g.get_repo(repo_name)
"/Users/mac/.anyenv/envs/pyenv/versions/3.9.9/lib/python3.9/http/client.py", line 1258, in putheader
values[i] = one_value.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u201c' in position 6: ordinal not in range(256)
Details
File "/Users/mac/project/kerraform/./auto_git_api.py", line 80, in pull_request pr_repo = g.get_repo(repo_name) File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/github/MainClass.py", line 330, in get_repo headers, data = self.__requester.requestJsonAndCheck("GET", url) File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/github/Requester.py", line 354, in requestJsonAndCheck *self.requestJson( File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/github/Requester.py", line 454, in requestJson return self.__requestEncode(cnx, verb, url, parameters, headers, input, encode) File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/github/Requester.py", line 528, in __requestEncode status, responseHeaders, output = self.__requestRaw( File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/github/Requester.py", line 555, in __requestRaw response = cnx.getresponse() File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/github/Requester.py", line 127, in getresponse r = verb( File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/requests/sessions.py", line 542, in get return self.request('GET', url, **kwargs) File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/requests/sessions.py", line 529, in request resp = self.send(prep, **send_kwargs) File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/requests/sessions.py", line 645, in send r = adapter.send(request, **kwargs) File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/requests/adapters.py", line 440, in send resp = conn.urlopen( File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/urllib3/connection.py", line 239, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/Users/mac/.anyenv/envs/pyenv/versions/3.9.9/lib/python3.9/http/client.py", line 1285, in request self._send_request(method, url, body, headers, encode_chunked) File "/Users/mac/.anyenv/envs/pyenv/versions/3.9.9/lib/python3.9/http/client.py", line 1326, in _send_request self.putheader(hdr, value) File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/urllib3/connection.py", line 224, in putheader _HTTPConnection.putheader(self, header, *values) File "/Users/mac/.anyenv/envs/pyenv/versions/3.9.9/lib/python3.9/http/client.py", line 1258, in putheader values[i] = one_value.encode('latin-1') UnicodeEncodeError: 'latin-1' codec can't encode character '\u201c' in position 6: ordinal not in range(256)