private-gpt [BUG] Unable to process PDF files containing Traditional Chinese characters, reporting encoding errors

Pre-check

[X] I have searched the existing issues and none cover this bug.

Description

10:36:32.953 [INFO ] private_gpt.components.ingest.ingest_component - Ingesting file_name=11449514-0.PDF 10:36:32.954 [INFO ] private_gpt.components.ingest.ingest_component - Ingesting file_name=11449527-0.PDF multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/site-packages/injector/init.py", line 800, in get return self._context[key] ~~~~~~~~~~~~~^^^^^ KeyError: <class 'private_gpt.server.ingest.ingest_service.IngestService'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) ^^^^^^^^^^^^^^^^^^^ File "/data/software/private-gpt/private_gpt/components/ingest/ingest_helper.py", line 74, in transform_file_into_documents documents = IngestionHelper._load_file_to_documents(file_name, file_data) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/software/private-gpt/private_gpt/components/ingest/ingest_helper.py", line 92, in _load_file_to_documents return string_reader.load_data([file_data.read_text()]) ^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/pathlib.py", line 1059, in read_text return f.read() ^^^^^^^^ File "", line 322, in decode UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 11: invalid start byte """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/data/software/private-gpt/scripts/ingest_folder.py", line 121, in worker.ingest_folder(root_path, args.ignored) File "/data/software/private-gpt/scripts/ingest_folder.py", line 57, in ingest_folder self._ingest_all(self._files_under_root_folder) File "/data/software/private-gpt/scripts/ingest_folder.py", line 61, in _ingest_all self.ingest_service.bulk_ingest([(str(p.name), p) for p in files_to_ingest]) File "/data/software/private-gpt/private_gpt/server/ingest/ingest_service.py", line 87, in bulk_ingest documents = self.ingest_component.bulk_ingest(files) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/software/private-gpt/private_gpt/components/ingest/ingest_component.py", line 279, in bulk_ingest self._ingest_work_pool.starmap(self.ingest, files) File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 375, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 774, in get raise self._value File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) ^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 51, in starmapstar return list(itertools.starmap(args[0], args[1])) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/software/private-gpt/private_gpt/components/ingest/ingest_component.py", line 264, in ingest documents = self._file_to_documents_work_pool.apply( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 360, in apply return self.apply_async(func, args, kwds).get() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 774, in get raise self._value UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 11: invalid start byte make: *** [Makefile:52：ingest] error 1

Steps to Reproduce

#PGPT_PROFILES=ollama-pg make ingest /data/private_gpt_data/s_reports/s_hk_reports/ -- --watch

Expected Behavior

can ingesting normal

Actual Behavior

UnicodeDecodeError

Environment

CPU Python 3.11.10

Additional Information

No response

Version

No response

Setup Checklist

[ ] Confirm that you have followed the installation instructions in the project’s documentation.
[X] Check that you are using the latest version of the project.
[ ] Verify disk space availability for model storage and data processing.
[ ] Ensure that you have the necessary permissions to run the project.

NVIDIA GPU Setup Checklist

[ ] Check that the all CUDA dependencies are installed and are compatible with your GPU (refer to CUDA's documentation)
[ ] Ensure an NVIDIA GPU is installed and recognized by the system (run nvidia-smi to verify).
[ ] Ensure proper permissions are set for accessing GPU resources.
[ ] Docker users - Verify that the NVIDIA Container Toolkit is configured correctly (e.g. run sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi)

Nov 25 '24 02:11 ulnit

I have similar issue: Generating embeddings: 0it [00:00, ?it/s] Traceback (most recent call last): File "/Users/user/AI/private-gpt/scripts/ingest_folder.py", line 122, in worker.ingest_folder(root_path, args.ignored) File "/Users/user/AI/private-gpt/scripts/ingest_folder.py", line 58, in ingest_folder self._ingest_all(self._files_under_root_folder) File "/Users/user/AI/private-gpt/scripts/ingest_folder.py", line 62, in _ingest_all self.ingest_service.bulk_ingest([(str(p.name), p) for p in files_to_ingest]) File "/Users/user/AI/private-gpt/private_gpt/server/ingest/ingest_service.py", line 87, in bulk_ingest documents = self.ingest_component.bulk_ingest(files) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/user/AI/private-gpt/private_gpt/components/ingest/ingest_component.py", line 132, in bulk_ingest documents = IngestionHelper.transform_file_into_documents( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/user/AI/private-gpt/private_gpt/components/ingest/ingest_helper.py", line 74, in transform_file_into_documents documents = IngestionHelper._load_file_to_documents(file_name, file_data) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/user/AI/private-gpt/private_gpt/components/ingest/ingest_helper.py", line 92, in _load_file_to_documents return string_reader.load_data([file_data.read_text()]) ^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/pathlib.py", line 1059, in read_text return f.read() ^^^^^^^^ File "", line 322, in decode UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 14: invalid start byte make: *** [ingest] Error 1

private-gpt % cat version.txt 0.6.2

Nov 28 '24 07:11 yaziciali

@ulnit @yaziciali Hi, can you provide some test data? I've tried a Simplified Chinese or Traditional Chinese PDF, and everything is working fine.

settings:

server:
  env_name: ${APP_ENV:ollama}

llm:
  mode: ollama
  max_new_tokens: 512
  context_window: 3900
  temperature: 0.1     #The temperature of the model. Increasing the temperature will make the model answer more creatively. A value of 0.1 would be more factual. (Default: 0.1)

embedding:
  mode: ollama

ollama:
  llm_model: llama3.2
  embedding_model: bge-m3
  api_base: http://localhost:11434
  embedding_api_base: http://localhost:11434  # change if your embedding model runs on another ollama
  keep_alive: 5m
  tfs_z: 1.0              # Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting.
  top_k: 40               # Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40)
  top_p: 0.9              # Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)
  repeat_last_n: 64       # Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx)
  repeat_penalty: 1.2     # Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1)
  request_timeout: 120.0  # Time elapsed until ollama times out the request. Default is 120s. Format is float.

vectorstore:
  database: qdrant

qdrant:
  path: local_data/private_gpt/qdrant

Jan 10 '25 02:01 navono