private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

Everything I have been trying to ingest caused an encoding error

Open unalignedcoder opened this issue 1 year ago • 6 comments

On Windows 10, Python 3.11.6

Using bulk ingestion, with the command: poetry run python scripts/ingest_folder.py "folder\path"

I keep getting this error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 169: character maps to <undefined>, after which ingestion stops. In fact, judging from the Gradio interface, nothing has been ingested at all.

I tried different folders, with mixed kinds of files (emails, ebooks...). Sooner or later it encounters a file which breaks the process.

I have seen previous similar issues being addressed by making a correction in relation to encoding, to a file name ingest.py, which however I cannot find anywhere.

Full console error message:

Traceback (most recent call last):
  File "I:\privateGPT-0.0.2\scripts\ingest_folder.py", line 41, in <module>
    _recursive_ingest_folder(path)
  File "I:\privateGPT-0.0.2\scripts\ingest_folder.py", line 28, in _recursive_ingest_folder
    _recursive_ingest_folder(file_path)
  File "I:\privateGPT-0.0.2\scripts\ingest_folder.py", line 28, in _recursive_ingest_folder
    _recursive_ingest_folder(file_path)
  File "I:\privateGPT-0.0.2\scripts\ingest_folder.py", line 26, in _recursive_ingest_folder
    _do_ingest(file_path)
  File "I:\privateGPT-0.0.2\scripts\ingest_folder.py", line 34, in _do_ingest
    ingest_service.ingest(changed_path.name, changed_path)
  File "I:\privateGPT-0.0.2\private_gpt\server\ingest\ingest_service.py", line 80, in ingest
    text = file_data.read_text()
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<user>\.pyenv\pyenv-win\versions\3.11.6\Lib\pathlib.py", line 1059, in read_text
    return f.read()
           ^^^^^^^^
  File "C:\Users\<user>\.pyenv\pyenv-win\versions\3.11.6\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 169: character maps to <undefined>

unalignedcoder avatar Nov 01 '23 15:11 unalignedcoder