semantic-code-search
semantic-code-search copied to clipboard
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 132693: invalid start byte
Getting this error during generation of embeddings:
Traceback (most recent call last):
File "/home/user/.local/bin/sem", line 8, in <module>
sys.exit(main())
File "/home/user/.local/lib/python3.10/site-packages/semantic_code_search/cli.py", line 84, in main
query_func(args)
File "/home/user/.local/lib/python3.10/site-packages/semantic_code_search/cli.py", line 38, in query_func
do_query(args, model)
File "/home/user/.local/lib/python3.10/site-packages/semantic_code_search/query.py", line 51, in do_query
do_embed(args, model)
File "/home/user/.local/lib/python3.10/site-packages/semantic_code_search/embed.py", line 82, in do_embed
functions = _get_repo_functions(
File "/home/user/.local/lib/python3.10/site-packages/semantic_code_search/embed.py", line 71, in _get_repo_functions
file_content = f.read()
File "/usr/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 132693: invalid start byte
after it already successfully processed quite a few files:
27%|████████████████████████▎ | 35036/130013 [00:33<01:30, 1047.23it/s]
As workaround, I just added try/catch to the affected lines:
def _get_repo_functions(root, supported_file_extensions, relevant_node_types):
functions = []
print('Extracting functions from {}'.format(root))
for fp in tqdm([root + '/' + f for f in os.popen('git -C {} ls-files'.format(root)).read().split('\n')]):
if not os.path.isfile(fp):
continue
with open(fp, 'r') as f:
lang = supported_file_extensions.get(fp[fp.rfind('.'):])
if lang:
try:
parser = get_parser(lang)
file_content = f.read()
tree = parser.parse(bytes(file_content, 'utf8'))
all_nodes = list(_traverse_tree(tree.root_node))
functions.extend(_extract_functions(
all_nodes, fp, file_content, relevant_node_types))
except Exception as e:
print(f"Hit error while parsing {fp}: {e}")
return functions
It shows quite a lot of third-party files in my repo. Since these are third-party, I cannot update/fix them. Should sem
be made robust against such issues?
Maybe the requirement to have UTF-8 encoding for the files could be dropped. Ideas: https://stackoverflow.com/questions/22216076/unicodedecodeerror-utf8-codec-cant-decode-byte-0xa5-in-position-0-invalid-s
Using yours code, I looked at non-utf8 files and changed their encodings; then restarted sem; now it goes through fixed non-utf-8 files.