vint icon indicating copy to clipboard operation
vint copied to clipboard

Fix UnicodeDecodeError

Open tmsanrinsha opened this issue 7 years ago • 4 comments

Original code does not take into account scriptencoding is comment or not. So UnicodeDecodeError occures in the code

" scriptencoding とは
Traceback (most recent call last):
  File "/Users/tmsanrinsha/python/bin/vint", line 11, in <module>
    sys.exit(main())
  File "/Users/tmsanrinsha/python/lib/python/site-packages/vint/__init__.py", line 11, in main
    init_cli()
  File "/Users/tmsanrinsha/python/lib/python/site-packages/vint/bootstrap.py", line 22, in init_cli
    cli.start()
  File "/Users/tmsanrinsha/python/lib/python/site-packages/vint/linting/cli.py", line 27, in start
    violations = self._lint_all(env, config_dict)
  File "/Users/tmsanrinsha/python/lib/python/site-packages/vint/linting/cli.py", line 120, in _lint_all
    violations += linter.lint_file(file_path)
  File "/Users/tmsanrinsha/python/lib/python/site-packages/vint/linting/linter.py", line 107, in lint_file
    root_ast = self._parser.parse_file(path)
  File "/Users/tmsanrinsha/python/lib/python/site-packages/vint/ast/parsing.py", line 37, in parse_file
    decoded = decoder.read(file_path)
  File "/Users/tmsanrinsha/python/lib/python/site-packages/vint/encodings/decoder.py", line 30, in read
    string = self.strategy.decode(hunk, debug_hint=debug_hint_for_the_loc)
  File "/Users/tmsanrinsha/python/lib/python/site-packages/vint/encodings/decoding_strategy.py", line 45, in decode
    string_candidate = strategy.decode(bytes_seq, debug_hint)
  File "/Users/tmsanrinsha/python/lib/python/site-packages/vint/encodings/decoding_strategy.py", line 77, in decode
    return bytes_seq.decode(encoding=encoding_part.decode(encoding='ascii'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

This PR fixes the problem.

Sample output:

#!/usr/bin/env python
import re

def _split_by_scriptencoding(bytes_seq):
    # type: (bytes) -> [(str, bytes)]
    max_end_index = len(bytes_seq)
    start_index = 0
    bytes_seq_and_loc_list = []

    for m in re.finditer(b'^\s*(scriptencoding)', bytes_seq, re.MULTILINE):
        end_index = m.start(1)

        if end_index == 0:
            continue

        bytes_seq_and_loc_list.append((
            "{start_index}:{end_index}".format(start_index=start_index, end_index=end_index),
            bytes_seq[start_index:end_index]
        ))
        start_index = end_index

    bytes_seq_and_loc_list.append((
        "{start_index}:{end_index}".format(start_index=start_index, end_index=max_end_index),
        bytes_seq[start_index:max_end_index]
    ))

    return bytes_seq_and_loc_list


str = '''scriptencoding utf-8
" scriptencoding あ
echo 'scriptencoding い'
 scriptencoding utf-8
'''

print(_split_by_scriptencoding(str.encode()))

output

[('0:69', b'scriptencoding utf-8\n" scriptencoding \xe3\x81\x82\necho \'scriptencoding \xe3\x81\x84\'\n '), ('69:90', b'scriptencoding utf-8\n')]

tmsanrinsha avatar Mar 11 '18 03:03 tmsanrinsha

Sorry for my too late reply.

We should support the following abnormal situation if we can:

:::::
    \scriptencoding utf8

How do you feel about it?

Kuniwak avatar Jun 18 '18 07:06 Kuniwak

@tmsanrinsha Please reply to the last comment from @Kuniwak / provide an update.

blueyed avatar Nov 29 '18 15:11 blueyed

Also a test would be needed.

blueyed avatar Nov 29 '18 15:11 blueyed

@tmsanrinsha Ping. I'd like to do a new release soonish, and it would be great to have this included.

blueyed avatar Apr 11 '19 08:04 blueyed