private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

Ingest Initialization Error

Open quartet opened this issue 1 year ago • 1 comments

I encountered an error while running the ingest.py script. The error message indicates an issue with the sentence-transformers model configuration file. Here are the steps to reproduce the bug:

  1. Clone the PrivateGPT repository from GitHub.
  2. Set up the environment with the required dependencies as mentioned in the repository's documentation.
  3. Execute the ingest.py script with the following command:
  •      python ingest.py
    

Expected behavior

I expected the ingest.py script to run successfully and process the documents without any errors. Environment

OS / hardware: Ubuntu 20.04 LTS / Intel Core i7 / 16GB RAM / 512GB SSD Python version: 3.9.6 Other relevant information: I have followed the installation instructions provided in the PrivateGPT repository and have the required packages installed. Additional context

The ggml-model-q4_0.bin file is located in the models/ directory. I have verified that the file is a valid JSON configuration file. However, the error message suggests that it is not recognized as a valid JSON file during the execution of the script

Error: $ python ingest.py No sentence-transformers model found with name models/ggml-model-q4_0.bin. Creating a new one with MEAN pooling. Traceback (most recent call last): File "/home/quartet/PrivateGPT/privategpt/lib/python3.10/site-packages/transformers/configuration_utils.py", line 659, in _get_config_dict config_dict = cls._dict_from_json_file(resolved_config_file) File "/home/quartet/PrivateGPT/privategpt/lib/python3.10/site-packages/transformers/configuration_utils.py", line 750, in _dict_from_json_file text = reader.read() File "/usr/lib/python3.10/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 24: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/quartet/PrivateGPT/privategpt/privateGPT/ingest.py", line 170, in main() File "/home/quartet/PrivateGPT/privategpt/privateGPT/ingest.py", line 147, in main embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name) File "/home/quartet/PrivateGPT/privategpt/lib/python3.10/site-packages/langchain/embeddings/huggingface.py", line 54, in init self.client = sentence_transformers.SentenceTransformer( File "/home/quartet/PrivateGPT/privategpt/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 97, in init modules = self._load_auto_model(model_path) File "/home/quartet/PrivateGPT/privategpt/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 806, in _load_auto_model transformer_model = Transformer(model_name_or_path) File "/home/quartet/PrivateGPT/privategpt/lib/python3.10/site-packages/sentence_transformers/models/Transformer.py", line 28, in init config = AutoConfig.from_pretrained(model_name_or_path, **model_args, cache_dir=cache_dir) File "/home/quartet/PrivateGPT/privategpt/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 928, in from_pretrained config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) File "/home/quartet/PrivateGPT/privategpt/lib/python3.10/site-packages/transformers/configuration_utils.py", line 574, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) File "/home/quartet/PrivateGPT/privategpt/lib/python3.10/site-packages/transformers/configuration_utils.py", line 662, in _get_config_dict raise EnvironmentError( OSError: It looks like the config file at 'models/ggml-model-q4_0.bin' is not a valid JSON file. $ cat .env PERSIST_DIRECTORY=db #LLAMA_EMBEDDINGS_MODEL=models/ggml-model-q4_0.bin LLAMA_EMBEDDINGS_MODEL=models/ggml-model-q4_0.bin MODEL_TYPE=GPT4All MODEL_PATH=models/ggml-gpt4all-j-v1.3-groovy.bin MODEL_N_CTX=1000

quartet avatar May 30 '23 22:05 quartet

FYI, this is a known broader issue with Python and Windows 10

My solution was to add the following to both ingest.py and privateGPT.py - just before the load_dotenv() call.

HTH


import sys
import io

sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding='utf-8')

djplaner avatar Jun 08 '23 01:06 djplaner