private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

Unknown Token

Open williamsoo opened this issue 2 years ago • 40 comments

Hi,

keep having this issue, please advise.

gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö'

williamsoo avatar May 21 '23 15:05 williamsoo

same issue using state_of_the_union.txt for fine time.

marutichintan avatar May 21 '23 15:05 marutichintan

Same issue as well. First time loading state_of_the_union.

Using embedded DuckDB with persistence: data will be stored in: db gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ... gptj_model_load: n_vocab = 50400 gptj_model_load: n_ctx = 2048 gptj_model_load: n_embd = 4096 gptj_model_load: n_head = 16 gptj_model_load: n_layer = 28 gptj_model_load: n_rot = 64 gptj_model_load: f16 = 2 gptj_model_load: ggml ctx size = 4505.45 MB gptj_model_load: memory_size = 896.00 MB, n_mem = 57344 gptj_model_load: ................................... done gptj_model_load: model size = 3609.38 MB / num tensors = 285

Enter a query: What is NATO gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö'

techno-yogi avatar May 21 '23 15:05 techno-yogi

Same issue

Using embedded DuckDB with persistence: data will be stored in: db gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ... gptj_model_load: n_vocab = 50400 gptj_model_load: n_ctx = 2048 gptj_model_load: n_embd = 4096 gptj_model_load: n_head = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot = 64 gptj_model_load: f16 = 2 gptj_model_load: ggml ctx size = 4505.45 MB gptj_model_load: memory_size = 896.00 MB, n_mem = 57344 gptj_model_load: ................................... done gptj_model_load: model size = 3609.38 MB / num tensors = 285

Enter a query: What is the purpose of the NATO Alliance? gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö'

GuySarkinsky avatar May 21 '23 15:05 GuySarkinsky

Just come back to the houses 

Sent from Yahoo Mail for iPhone

On Sunday, May 21, 2023, 10:50 AM, Guy Sarkinsky @.***> wrote:

Same issue

Using embedded DuckDB with persistence: data will be stored in: db gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ... gptj_model_load: n_vocab = 50400 gptj_model_load: n_ctx = 2048 gptj_model_load: n_embd = 4096 gptj_model_load: n_head = 16 gptj_model_load: n_layer = 28 gptj_model_load: n_rot = 64 gptj_model_load: f16 = 2 gptj_model_load: ggml ctx size = 4505.45 MB gptj_model_load: memory_size = 896.00 MB, n_mem = 57344 gptj_model_load: ................................... done gptj_model_load: model size = 3609.38 MB / num tensors = 285

Enter a query: What is the purpose of the NATO Alliance? gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö'

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Sophiaschepers avatar May 21 '23 15:05 Sophiaschepers

same issue: Using embedded DuckDB with persistence: data will be stored in: db gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ... gptj_model_load: n_vocab = 50400 gptj_model_load: n_ctx = 2048 gptj_model_load: n_embd = 4096 gptj_model_load: n_head = 16 gptj_model_load: n_layer = 28 gptj_model_load: n_rot = 64 gptj_model_load: f16 = 2 gptj_model_load: ggml ctx size = 4505.45 MB gptj_model_load: memory_size = 896.00 MB, n_mem = 57344 gptj_model_load: ................................... done gptj_model_load: model size = 3609.38 MB / num tensors = 285

Enter a query: why was nato created gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' Killed

ksejan avatar May 21 '23 17:05 ksejan

Same. Grrrrrrr. And on youtube, video tutorials they say, Oh you just need to download it, do this and that and Its so simple, yeah right....................

Python 3.10.6<---------- If you ask

gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ... gptj_model_load: n_vocab = 50400 gptj_model_load: n_ctx = 2048 gptj_model_load: n_embd = 4096 gptj_model_load: n_head = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot = 64
gptj_model_load: f16 = 2
gptj_model_load: ggml ctx size = 4505.45 MB gptj_model_load: memory_size = 896.00 MB, n_mem = 57344 gptj_model_load: ................................... done gptj_model_load: model size = 3609.38 MB / num tensors = 285

Enter a query: what kind if Hitler Person likes dogs? gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'ť' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ł' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö'

panmaster avatar May 21 '23 17:05 panmaster

👋🏽 Having the same issue.

OptoCode avatar May 21 '23 17:05 OptoCode

same issue here

boboxuan avatar May 21 '23 19:05 boboxuan

same issue but for '�'

jnguyen1098 avatar May 21 '23 20:05 jnguyen1098

Same issue, but with '?'

codingbutstillalive avatar May 21 '23 20:05 codingbutstillalive

same !!

rachidje avatar May 21 '23 21:05 rachidje

Same issue here

Enter a query: What is state of the union? gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒' gpt_tokenize: unknown token '▒'

emmanero90 avatar May 21 '23 21:05 emmanero90

same issue, my windows 10 region is Israel, and extra language and keyboard is Hebrew, maybe it has got to do with Unicode or non ASCII letters enabled on OS (since all the letters it is saying unknown are hebrew letters, though there are no hebrew letters in the path, folders, source documents or query i entered)

image

shaybc avatar May 21 '23 22:05 shaybc

so apperantly if i gave it enough time to run it will issue a response (although not a very smart one)

image

answering this simple question takes a very very long time (about 7 minutes), each word printed as an answer takes about a second (for each word to print)

the next question also shows unknown letters (although fewer letters):

image

i am using:

  • AMD Ryzen 5 3400G with Radeon Vega Graphics
  • 64G

and most resources are available during run

image

image

shaybc avatar May 21 '23 23:05 shaybc

same issue

xieyx avatar May 22 '23 01:05 xieyx

gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�'

image

same issue

bravekingzhang avatar May 22 '23 07:05 bravekingzhang

Just come back to the houses  Sent from Yahoo Mail for iPhone On Sunday, May 21, 2023, 10:50 AM, Guy Sarkinsky @.> wrote: Same issue Using embedded DuckDB with persistence: data will be stored in: db gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ... gptj_model_load: n_vocab = 50400 gptj_model_load: n_ctx = 2048 gptj_model_load: n_embd = 4096 gptj_model_load: n_head = 16 gptj_model_load: n_layer = 28 gptj_model_load: n_rot = 64 gptj_model_load: f16 = 2 gptj_model_load: ggml ctx size = 4505.45 MB gptj_model_load: memory_size = 896.00 MB, n_mem = 57344 gptj_model_load: ................................... done gptj_model_load: model size = 3609.38 MB / num tensors = 285 Enter a query: What is the purpose of the NATO Alliance? gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.>

Hi @Sophiaschepers

Do you mean the path of the module in the file .env?

image

GuySarkinsky avatar May 22 '23 08:05 GuySarkinsky

Same issue. I probably had every error in the book by far. Frustrating to hit the wall at this point

Nemcade avatar May 22 '23 10:05 Nemcade

Yep, you guessed it. same issue....

(my_env) PS C:\Users\rahim\OneDrive\Desktop\TEST\privateGPT> python privateGPT.py Using embedded DuckDB with persistence: data will be stored in: db gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ... gptj_model_load: n_vocab = 50400 gptj_model_load: n_ctx = 2048 gptj_model_load: n_embd = 4096 gptj_model_load: n_head = 16 gptj_model_load: n_layer = 28 gptj_model_load: n_rot = 64 gptj_model_load: f16 = 2 gptj_model_load: ggml ctx size = 4505.45 MB gptj_model_load: memory_size = 896.00 MB, n_mem = 57344 gptj_model_load: ................................... done gptj_model_load: model size = 3609.38 MB / num tensors = 285

Enter a query: What is this document about? gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Γ' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö'

rahiminabdulamin avatar May 22 '23 11:05 rahiminabdulamin

Same issue

python privateGPT.py
Using embedded DuckDB with persistence: data will be stored in: db
gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 4505.45 MB
gptj_model_load: memory_size =   896.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285

Enter a query: How are you
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'

gonzalo-dae-mn avatar May 22 '23 13:05 gonzalo-dae-mn

Same issue

`C:\Users\Jack\Documents\privateGPT>python privategpt.py Using embedded DuckDB with persistence: data will be stored in: db gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ... gptj_model_load: n_vocab = 50400 gptj_model_load: n_ctx = 2048 gptj_model_load: n_embd = 4096 gptj_model_load: n_head = 16 gptj_model_load: n_layer = 28 gptj_model_load: n_rot = 64 gptj_model_load: f16 = 2 gptj_model_load: ggml ctx size = 4505.45 MB gptj_model_load: memory_size = 896.00 MB, n_mem = 57344 gptj_model_load: ................................... done gptj_model_load: model size = 3609.38 MB / num tensors = 285

Enter a query: what is nato gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö'`

mu-sh avatar May 22 '23 13:05 mu-sh

gptj_model_load: loading model from '/root/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ... gptj_model_load: n_vocab = 50400 gptj_model_load: n_ctx = 2048 gptj_model_load: n_embd = 4096 gptj_model_load: n_head = 16 gptj_model_load: n_layer = 28 gptj_model_load: n_rot = 64 gptj_model_load: f16 = 2 gptj_model_load: ggml ctx size = 4505.45 MB gptj_model_load: memory_size = 896.00 MB, n_mem = 57344 gptj_model_load: ................................... done gptj_model_load: model size = 3609.38 MB / num tensors = 285

Enter a query: rtx gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' Killed

mikeyang01 avatar May 22 '23 13:05 mikeyang01

Patience is the key, to see these are not impacting the solution as such, It would be better to understand the root of these messages to resolve the problem. Any insight into source would be helpful to resolve the issue.

kickstart530 avatar May 22 '23 15:05 kickstart530

Can it be something that happens with Python 3.11 but not with Python 3.10?

Orrouk avatar May 22 '23 17:05 Orrouk

Can it be something that happens with Python 3.11 but not with Python 3.10?

I use 3.10.9 and have the same issue.

AsaTyr2018 avatar May 22 '23 18:05 AsaTyr2018

This is purely because of some \u unicode characters in the input documents provided. This is not to worry about at all.

gaurav-cointab avatar May 23 '23 05:05 gaurav-cointab

This is purely because of some \u unicode characters in the input documents provided. This is not to worry about at all.

Understand where you are coming from. However my source document is the orginial state of the union text file.

williamsoo avatar May 23 '23 05:05 williamsoo

This is purely because of some \u unicode characters in the input documents provided. This is not to worry about at all.

i used the original document and also for test a plaintext document without formatting or "non-english characters" (1-9 a-z) and still the same result. gpt_tokenize: unknown token all over the console. it looks like it formulates the answer this way.. Found this:

gpt_tokenize: unknown token 'T' gpt_tokenize: unknown token 'E' gpt_tokenize: unknown token 'S' gpt_tokenize: unknown token 'T' gpt_tokenize: unknown token '' gpt_tokenize: unknown token 'T' gpt_tokenize: unknown token 'E' gpt_tokenize: unknown token 'S' gpt_tokenize: unknown token 'T'

That was the queue that i Input for Testing. and after this output it crashes back to the terminal prompt.

AsaTyr2018 avatar May 23 '23 07:05 AsaTyr2018

Sam issue, has anyone found a workaround or a solution?

khanimranj avatar May 23 '23 07:05 khanimranj

Same issue. And I runing in win11 with Python 3.11 gpt_tokenize: unknown token '? gpt_tokenize: unknown token '€'

zeffon avatar May 23 '23 08:05 zeffon