grimoirelab-perceval icon indicating copy to clipboard operation
grimoirelab-perceval copied to clipboard

[git] Use "backslashreplace" instead of "surrogateescape".

Open jgbarah opened this issue 8 years ago • 7 comments

When decoding as utf8, if the character cannnot be decoded, use the backslashreplace error handler, instead of the surrogateescape error handler.

Fixes #18 for git backend, maybe others should be fixed too.

jgbarah avatar Mar 18 '16 22:03 jgbarah

When I run the tests I get the next errors:

~/devel/grimoire/perceval/tests$ python3 test_git.py 
....E..E....................
======================================================================
ERROR: test_git_encoding_error (__main__.TestGitBackend)
Test if encoding errors are escaped when a git log is parsed
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_git.py", line 199, in test_git_encoding_error
    result = [commit for commit in commits]
  File "test_git.py", line 199, in <listcomp>
    result = [commit for commit in commits]
  File "../perceval/backends/git.py", line 160, in parse_git_log_from_file
    for commit in parser.parse():
  File "../perceval/backends/git.py", line 375, in parse
    for line in self.stream:
  File "/usr/lib/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
TypeError: don't know how to handle UnicodeDecodeError in error callback

======================================================================
ERROR: test_git_utf8_error (__main__.TestGitBackend)
Characters that cannot decoded as utf8 can be later encoded as utf8.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_git.py", line 220, in test_git_utf8_error
    commit = [commit for commit in commits][0]
  File "test_git.py", line 220, in <listcomp>
    commit = [commit for commit in commits][0]
  File "../perceval/backends/git.py", line 160, in parse_git_log_from_file
    for commit in parser.parse():
  File "../perceval/backends/git.py", line 375, in parse
    for line in self.stream:
  File "/usr/lib/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
TypeError: don't know how to handle UnicodeDecodeError in error callback

----------------------------------------------------------------------
Ran 28 tests in 0.176s

FAILED (errors=2)

sduenas avatar Mar 19 '16 19:03 sduenas

I had forgotten about running all tests, sorry. But when I just did, curiously enough I get a different error:

$ python3 test_git.py 
....F.......................
======================================================================
FAIL: test_git_encoding_error (__main__.TestGitBackend)
Test if encoding errors are escaped when a git log is parsed
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_git.py", line 205, in test_git_encoding_error
    self.assertEqual(commit['message'], 'Calling \udc93Open Type\udc94 (CTRL+SHIFT+T) after startup - performance improvement.')
AssertionError: 'Calling \\x93Open Type\\x94 (CTRL+SHIFT+T) after s[29 chars]ent.' != 'Calling \udc93Open Type\udc94 (CTRL+SHIFT+T) after[31 chars]ent.'
- Calling \x93Open Type\x94 (CTRL+SHIFT+T) after startup - performance improvement.
?         ^^^^         ^^^^
+ Calling \udc93Open Type\udc94 (CTRL+SHIFT+T) after startup - performance improvement.
?         ^         ^


----------------------------------------------------------------------
Ran 28 tests in 0.394s

FAILED (failures=1)

I'm going to fix this one (which is due to the change of encoding in case of exceptions when encoding), by changing the expected result. And then I will have a look at the errors that are raised for you...

jgbarah avatar Mar 19 '16 19:03 jgbarah

I just git amend forced a new commit which in my side passes all tests:

$ python3 test_git.py 
............................
----------------------------------------------------------------------
Ran 28 tests in 0.388s

OK

Would you mind checking once again? Maybe I missed something, but I cannot reproduce the problem you see...

jgbarah avatar Mar 19 '16 20:03 jgbarah

It's still failing but I found why. Looks like backslashreplace was not supported before python 3.5 although the documentation says the opposite.

If we accept this change will mean Perceval will work only with Python 3.5. I don't see now any problems with it but for instance in Ubuntu 15.10, the default version is 3.4.

sduenas avatar Mar 21 '16 11:03 sduenas

Uhhhm. I hadn't noticed either :-( I can find another option, now that I sort of understand the problem, and produce some code dependent on Python being < 3.5. But maybe we could do that in separate PR, to make this trough and let it work in git repos such as that of the Linux kernel which need it...

jgbarah avatar Mar 22 '16 00:03 jgbarah

@sduenas do you think we could do this change, or something similar? If so, I can update the patch. Otherwise, we better close the PR.

jgbarah avatar Aug 05 '17 17:08 jgbarah

Here is way to do the same operation in Python 2 and Python 3 compatible way.

abitrolly avatar Dec 25 '18 20:12 abitrolly