grimoirelab-perceval
grimoirelab-perceval copied to clipboard
[git] Use "backslashreplace" instead of "surrogateescape".
When decoding as utf8, if the character cannnot be decoded, use the backslashreplace error handler, instead of the surrogateescape error handler.
Fixes #18 for git backend, maybe others should be fixed too.
When I run the tests I get the next errors:
~/devel/grimoire/perceval/tests$ python3 test_git.py
....E..E....................
======================================================================
ERROR: test_git_encoding_error (__main__.TestGitBackend)
Test if encoding errors are escaped when a git log is parsed
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_git.py", line 199, in test_git_encoding_error
result = [commit for commit in commits]
File "test_git.py", line 199, in <listcomp>
result = [commit for commit in commits]
File "../perceval/backends/git.py", line 160, in parse_git_log_from_file
for commit in parser.parse():
File "../perceval/backends/git.py", line 375, in parse
for line in self.stream:
File "/usr/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
TypeError: don't know how to handle UnicodeDecodeError in error callback
======================================================================
ERROR: test_git_utf8_error (__main__.TestGitBackend)
Characters that cannot decoded as utf8 can be later encoded as utf8.
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_git.py", line 220, in test_git_utf8_error
commit = [commit for commit in commits][0]
File "test_git.py", line 220, in <listcomp>
commit = [commit for commit in commits][0]
File "../perceval/backends/git.py", line 160, in parse_git_log_from_file
for commit in parser.parse():
File "../perceval/backends/git.py", line 375, in parse
for line in self.stream:
File "/usr/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
TypeError: don't know how to handle UnicodeDecodeError in error callback
----------------------------------------------------------------------
Ran 28 tests in 0.176s
FAILED (errors=2)
I had forgotten about running all tests, sorry. But when I just did, curiously enough I get a different error:
$ python3 test_git.py
....F.......................
======================================================================
FAIL: test_git_encoding_error (__main__.TestGitBackend)
Test if encoding errors are escaped when a git log is parsed
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_git.py", line 205, in test_git_encoding_error
self.assertEqual(commit['message'], 'Calling \udc93Open Type\udc94 (CTRL+SHIFT+T) after startup - performance improvement.')
AssertionError: 'Calling \\x93Open Type\\x94 (CTRL+SHIFT+T) after s[29 chars]ent.' != 'Calling \udc93Open Type\udc94 (CTRL+SHIFT+T) after[31 chars]ent.'
- Calling \x93Open Type\x94 (CTRL+SHIFT+T) after startup - performance improvement.
? ^^^^ ^^^^
+ Calling \udc93Open Type\udc94 (CTRL+SHIFT+T) after startup - performance improvement.
? ^ ^
----------------------------------------------------------------------
Ran 28 tests in 0.394s
FAILED (failures=1)
I'm going to fix this one (which is due to the change of encoding in case of exceptions when encoding), by changing the expected result. And then I will have a look at the errors that are raised for you...
I just git amend forced a new commit which in my side passes all tests:
$ python3 test_git.py
............................
----------------------------------------------------------------------
Ran 28 tests in 0.388s
OK
Would you mind checking once again? Maybe I missed something, but I cannot reproduce the problem you see...
It's still failing but I found why. Looks like backslashreplace
was not supported before python 3.5 although the documentation says the opposite.
If we accept this change will mean Perceval will work only with Python 3.5. I don't see now any problems with it but for instance in Ubuntu 15.10, the default version is 3.4.
Uhhhm. I hadn't noticed either :-( I can find another option, now that I sort of understand the problem, and produce some code dependent on Python being < 3.5. But maybe we could do that in separate PR, to make this trough and let it work in git repos such as that of the Linux kernel which need it...
@sduenas do you think we could do this change, or something similar? If so, I can update the patch. Otherwise, we better close the PR.
Here is way to do the same operation in Python 2 and Python 3 compatible way.