private-gpt
private-gpt copied to clipboard
Everything I have been trying to ingest caused an encoding error
On Windows 10, Python 3.11.6
Using bulk ingestion, with the command:
poetry run python scripts/ingest_folder.py "folder\path"
I keep getting this error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 169: character maps to <undefined>
, after which ingestion stops. In fact, judging from the Gradio interface, nothing has been ingested at all.
I tried different folders, with mixed kinds of files (emails, ebooks...). Sooner or later it encounters a file which breaks the process.
I have seen previous similar issues being addressed by making a correction in relation to encoding, to a file name ingest.py
, which however I cannot find anywhere.
Full console error message:
Traceback (most recent call last):
File "I:\privateGPT-0.0.2\scripts\ingest_folder.py", line 41, in <module>
_recursive_ingest_folder(path)
File "I:\privateGPT-0.0.2\scripts\ingest_folder.py", line 28, in _recursive_ingest_folder
_recursive_ingest_folder(file_path)
File "I:\privateGPT-0.0.2\scripts\ingest_folder.py", line 28, in _recursive_ingest_folder
_recursive_ingest_folder(file_path)
File "I:\privateGPT-0.0.2\scripts\ingest_folder.py", line 26, in _recursive_ingest_folder
_do_ingest(file_path)
File "I:\privateGPT-0.0.2\scripts\ingest_folder.py", line 34, in _do_ingest
ingest_service.ingest(changed_path.name, changed_path)
File "I:\privateGPT-0.0.2\private_gpt\server\ingest\ingest_service.py", line 80, in ingest
text = file_data.read_text()
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\<user>\.pyenv\pyenv-win\versions\3.11.6\Lib\pathlib.py", line 1059, in read_text
return f.read()
^^^^^^^^
File "C:\Users\<user>\.pyenv\pyenv-win\versions\3.11.6\Lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 169: character maps to <undefined>