private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

gpt_tokenize: unknown token '?'

Open anonimo28 opened this issue 1 year ago • 24 comments

gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' [1] 32658 killed python3 privateGPT.py

anonimo28 avatar May 09 '23 00:05 anonimo28

Have you fixed it? I meet this bug too.

moneymouse avatar May 09 '23 07:05 moneymouse

I'm getting the same error:

gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'ť'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ł'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'

bbscout avatar May 09 '23 08:05 bbscout

I get the same error but the query will still get a reply as it should.

x4g4p3x avatar May 09 '23 11:05 x4g4p3x

same error

nssiwi avatar May 09 '23 14:05 nssiwi

up

kamuridesu avatar May 09 '23 14:05 kamuridesu

https://github.com/su77ungr/CASALIOY

I hard forked the repository and switched to Qdrant vector storage. Runs locally as well and hits faster requests. This solved the issue for me

su77ungr avatar May 10 '23 06:05 su77ungr

Sounds great! Would you open a PR @su77ungr ?

imartinez avatar May 10 '23 06:05 imartinez

same error:

gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token '£' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ø' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö'

lsotillos avatar May 10 '23 12:05 lsotillos

https://github.com/su77ungr/CASALIOY

I hard forked the repository and switched to Qdrant vector storage. Runs locally as well and hits faster requests. This solved the issue for me

Tested it myself. It doesn't solve the "unknown token" warning, and the result is not faster nor more accurate than using Chroma.

imartinez avatar May 10 '23 17:05 imartinez

The error has to do with symbols being present in the original doc. There definitely are some of those in the test document used by this repo. But it is just a warning, it doesn't prevent the tool from working

imartinez avatar May 10 '23 18:05 imartinez

It's not possible to work with those characters with the default model. This has nothing to do with the vector storage. You have to use a different model. But qdrant does not fail on them - like chinese text.

Qdrant should be faster on the benchmark here. I vowed for the ease of implementation. I'm going to use a different retrieving algo too. That's the bottleneck. Also Qdrant will be way faster with a better implementation like this.

This let me open my own implemenation.

su77ungr avatar May 10 '23 18:05 su77ungr

Me too

gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token '£' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ø' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö'

assuredclean avatar May 10 '23 22:05 assuredclean

I got a similar result but they were unprintable. It'll also fail with some unicode characters.

dennydream avatar May 11 '23 14:05 dennydream

I'll keep an eye on the improvements you pointed @su77ungr and also on your fork. Thanks for sharing!!

imartinez avatar May 11 '23 17:05 imartinez

https://github.com/su77ungr/CASALIOY

I hard forked the repository and switched to Qdrant vector storage. Runs locally as well and hits faster requests. This solved the issue for me

What do you mean "Qdrant vector storage", can you explain please? I'm newbie.

Amarbo avatar May 12 '23 17:05 Amarbo

I think that MODEL_TYPE in .env does not match the actual model. I got this error when I was running a LlamaCpp model with MODEL_TYPE=GPT4All, it disappeared when I set MODEL_TYPE=LlamaCpp.

tk42 avatar May 13 '23 23:05 tk42

gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' [1] 32658 killed python3 privateGPT.py

the last line "killed" meaning just like mine you have a potato pc hehe, not enough memory

GitEin11 avatar May 14 '23 02:05 GitEin11

I think that MODEL_TYPE in .env does not match the actual model. I got this error when I was running a LlamaCpp model with MODEL_TYPE=GPT4All, it disappeared when I set MODEL_TYPE=LlamaCpp.

This fixed it for me

mabry1985 avatar May 14 '23 06:05 mabry1985

gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' [1] 32658 killed python3 privateGPT.py

the last line "killed" meaning just like mine you have a potato pc hehe, not enough memory

I get the exact same issue even though I have 32 GB of RAM, isn't that enough? Ingest.py takes like 1 sec but when I ask a question (on the suggested document), it just freezes my entire pc and the process gets killed (on Fedora 37)

JMans15 avatar May 17 '23 16:05 JMans15

gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' [1] 32658 killed python3 privateGPT.py

the last line "killed" meaning just like mine you have a potato pc hehe, not enough memory

I get the exact same issue even though I have 32 GB of RAM, isn't that enough? Ingest.py takes like 1 sec but when I ask a question (on the suggested document), it just freezes my entire pc and the process gets killed (on Fedora 37)

the way I understand, because the RAM goes brrrrrr, then when limit reach it is killed, this RAM usage must be prioritize rather than the unknown token message

GitEin11 avatar May 17 '23 16:05 GitEin11

That's what I suspected too, I just tried running it with __NV_PRIME_RENDER_OFFLOAD=1 and __GLX_VENDOR_LIBRARY_NAME=nvidia (I don't even know if it's supposed to run on the GPU) and now it just freezes my pc until I kill it manually

JMans15 avatar May 17 '23 16:05 JMans15

Use: 'python privateGPT.py 2>/dev/null' to start privateGPT. By adding >2/dev/null in the end of command you'll suppress Error messages (stderror 2). This is far away from a fix, but adds usability. Seems to work with Windows Git Bash as well :-)

late7 avatar May 18 '23 10:05 late7

The default SotU doc does have some non-ASCII chars. You can check pretty easily:

$ python -c "import chardet; print(chardet.detect(open('source_documents/state_of_the_union.txt', 'rb').read()))"
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

I suspect something in the processing chain (probably whatever is tokenizing the embeddings for the prompt) doesn't like non-ASCII UTF-8 tokens, which is very non-optimal. It might well be making the construction of prompt lossy, and useless if you're working with non-English content.

uogbuji avatar May 23 '23 18:05 uogbuji

Running this fixed my issue.

find /path/to/folder -type f -name "*.txt" -exec sh -c 'iconv -f utf-8 -t utf-8 -c "{}" | sed -e "s/[^[:print:]]/?/g" -e "s/[Çç]/C/g" -e "s/[Ğğ]/G/g" -e "s/[İı]/I/g" -e "s/[Öö]/O/g" -e "s/[Şş]/S/g" -e "s/[Üü]/U/g" > "{}.tmp" && mv "{}.tmp" "{}"' \;

veyselyenilmez avatar Jun 01 '23 13:06 veyselyenilmez