private-gpt
private-gpt copied to clipboard
gpt_tokenize: unknown token '▒'
I ran the repo with the default settings, and I asked "How are you today?" The code printed this "gpt_tokenize: unknown token '▒'" like 50 times, then it started to give the answer
same bug
same for me, takes about 10 minutes for each prompt
The script is still working in the background. The weird text is what could not have been read by the LLM. Just leave the script to run and it should output the result shortly after
I faced the same issue, however it doesn't give the answer and after many lines of gpt_tokenize it says Killed and terminates the script. Any remedies ?
Same for me
$ python3 privateGPT.py
Using embedded DuckDB with persistence: data will be stored in: db
gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx = 2048
gptj_model_load: n_embd = 4096
gptj_model_load: n_head = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot = 64
gptj_model_load: f16 = 2
gptj_model_load: ggml ctx size = 4505.45 MB
gptj_model_load: memory_size = 896.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size = 3609.38 MB / num tensors = 285
Enter a query: what's going on?
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
Killed
Same for me, python 3.10.6.
I also tried on another Ubuntu machine with Python 3.10.8 and got Failed building wheel for llama-cpp-python
while installing dependencies. I managed to resolve this after a while by adding gcc11 to the pip script.
CXX=g++-11 CC=gcc-11 pip install -r requirements.txt
And surprisedly this time it worked without gpt_tokenize: unknown token '�'
error. Still don't know what has caused the problem.
Tried what @tanhm12 did, but unfortunately still gives the same error. Any other fixes ? I am working with Ubuntu WSL btw.
$ python3 privateGPT.py
Using embedded DuckDB with persistence: data will be stored in: db
gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx = 2048
gptj_model_load: n_embd = 4096
gptj_model_load: n_head = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot = 64
gptj_model_load: f16 = 2
gptj_model_load: ggml ctx size = 4505.45 MB
gptj_model_load: memory_size = 896.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size = 3609.38 MB / num tensors = 285
Enter a query: hello there
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
Killed
See https://github.com/imartinez/privateGPT/issues/180 and https://github.com/imartinez/privateGPT/issues/214 This is a duplicate of many other issues.
got the same error in google colab and amazon sagemaker lab also
I have seen that it is actually the issue of the input text. I analysed my input file, and removed these special characters, and then it worked fine. I saw that when there is a "ctrl+enter" as the EOL, rather than the "enter" as the EOL. So I removed all these "ctrl + enter" and then some how it is working fine now in my case.
@gaurav-cointab ctrl-enter means ? I also cleared all the special characters from input file ex- in union file default input. But still getting the error what you did please tell l, I mean ctrl+enter ?