private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

gpt_tokenize: unknown token '▒'

Open gaurav-cointab opened this issue 1 year ago • 11 comments

I ran the repo with the default settings, and I asked "How are you today?" The code printed this "gpt_tokenize: unknown token '▒'" like 50 times, then it started to give the answer

gaurav-cointab avatar May 18 '23 03:05 gaurav-cointab

same bug

hchenphd avatar May 18 '23 05:05 hchenphd

same for me, takes about 10 minutes for each prompt

myonster avatar May 18 '23 06:05 myonster

The script is still working in the background. The weird text is what could not have been read by the LLM. Just leave the script to run and it should output the result shortly after

jondoescoding avatar May 18 '23 12:05 jondoescoding

I faced the same issue, however it doesn't give the answer and after many lines of gpt_tokenize it says Killed and terminates the script. Any remedies ?

rohankalbag avatar May 19 '23 03:05 rohankalbag

Same for me

$ python3 privateGPT.py
Using embedded DuckDB with persistence: data will be stored in: db
gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 4505.45 MB
gptj_model_load: memory_size =   896.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285

Enter a query: what's going on?
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
Killed

lukastillmann avatar May 19 '23 11:05 lukastillmann

Same for me, python 3.10.6. I also tried on another Ubuntu machine with Python 3.10.8 and got Failed building wheel for llama-cpp-python while installing dependencies. I managed to resolve this after a while by adding gcc11 to the pip script. CXX=g++-11 CC=gcc-11 pip install -r requirements.txt And surprisedly this time it worked without gpt_tokenize: unknown token '�' error. Still don't know what has caused the problem.

tanhm12 avatar May 20 '23 04:05 tanhm12

Tried what @tanhm12 did, but unfortunately still gives the same error. Any other fixes ? I am working with Ubuntu WSL btw.

$ python3 privateGPT.py
Using embedded DuckDB with persistence: data will be stored in: db
gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 4505.45 MB
gptj_model_load: memory_size =   896.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285

Enter a query: hello there
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
Killed

rohankalbag avatar May 20 '23 10:05 rohankalbag

See https://github.com/imartinez/privateGPT/issues/180 and https://github.com/imartinez/privateGPT/issues/214 This is a duplicate of many other issues.

PulpCattel avatar May 20 '23 10:05 PulpCattel

got the same error in google colab and amazon sagemaker lab also

satyamroy001 avatar May 21 '23 16:05 satyamroy001

I have seen that it is actually the issue of the input text. I analysed my input file, and removed these special characters, and then it worked fine. I saw that when there is a "ctrl+enter" as the EOL, rather than the "enter" as the EOL. So I removed all these "ctrl + enter" and then some how it is working fine now in my case.

gaurav-cointab avatar May 21 '23 16:05 gaurav-cointab

@gaurav-cointab ctrl-enter means ? I also cleared all the special characters from input file ex- in union file default input. But still getting the error what you did please tell l, I mean ctrl+enter ?

satyamroy001 avatar May 21 '23 17:05 satyamroy001