[BUG] Unable to process PDF files containing Traditional Chinese characters, reporting encoding errors
Pre-check
- [X] I have searched the existing issues and none cover this bug.
Description
10:36:32.953 [INFO ] private_gpt.components.ingest.ingest_component - Ingesting file_name=11449514-0.PDF 10:36:32.954 [INFO ] private_gpt.components.ingest.ingest_component - Ingesting file_name=11449527-0.PDF multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/site-packages/injector/init.py", line 800, in get return self._context[key] ~~~~~~~~~~~~~^^^^^ KeyError: <class 'private_gpt.server.ingest.ingest_service.IngestService'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/data/software/private-gpt/private_gpt/components/ingest/ingest_helper.py", line 74, in transform_file_into_documents
documents = IngestionHelper._load_file_to_documents(file_name, file_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/software/private-gpt/private_gpt/components/ingest/ingest_helper.py", line 92, in _load_file_to_documents
return string_reader.load_data([file_data.read_text()])
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/pathlib.py", line 1059, in read_text
return f.read()
^^^^^^^^
File "
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/data/software/private-gpt/scripts/ingest_folder.py", line 121, in
Steps to Reproduce
#PGPT_PROFILES=ollama-pg make ingest /data/private_gpt_data/s_reports/s_hk_reports/ -- --watch
Expected Behavior
can ingesting normal
Actual Behavior
UnicodeDecodeError
Environment
CPU Python 3.11.10
Additional Information
No response
Version
No response
Setup Checklist
- [ ] Confirm that you have followed the installation instructions in the project’s documentation.
- [X] Check that you are using the latest version of the project.
- [ ] Verify disk space availability for model storage and data processing.
- [ ] Ensure that you have the necessary permissions to run the project.
NVIDIA GPU Setup Checklist
- [ ] Check that the all CUDA dependencies are installed and are compatible with your GPU (refer to CUDA's documentation)
- [ ] Ensure an NVIDIA GPU is installed and recognized by the system (run
nvidia-smito verify). - [ ] Ensure proper permissions are set for accessing GPU resources.
- [ ] Docker users - Verify that the NVIDIA Container Toolkit is configured correctly (e.g. run
sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi)
I have similar issue:
Generating embeddings: 0it [00:00, ?it/s]
Traceback (most recent call last):
File "/Users/user/AI/private-gpt/scripts/ingest_folder.py", line 122, in
private-gpt % cat version.txt 0.6.2
@ulnit @yaziciali Hi, can you provide some test data? I've tried a Simplified Chinese or Traditional Chinese PDF, and everything is working fine.
settings:
server:
env_name: ${APP_ENV:ollama}
llm:
mode: ollama
max_new_tokens: 512
context_window: 3900
temperature: 0.1 #The temperature of the model. Increasing the temperature will make the model answer more creatively. A value of 0.1 would be more factual. (Default: 0.1)
embedding:
mode: ollama
ollama:
llm_model: llama3.2
embedding_model: bge-m3
api_base: http://localhost:11434
embedding_api_base: http://localhost:11434 # change if your embedding model runs on another ollama
keep_alive: 5m
tfs_z: 1.0 # Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting.
top_k: 40 # Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40)
top_p: 0.9 # Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)
repeat_last_n: 64 # Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx)
repeat_penalty: 1.2 # Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1)
request_timeout: 120.0 # Time elapsed until ollama times out the request. Default is 120s. Format is float.
vectorstore:
database: qdrant
qdrant:
path: local_data/private_gpt/qdrant