private-gpt gpt_tokenize: unknown token '?'

gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' [1] 32658 killed python3 privateGPT.py

May 09 '23 00:05 anonimo28

Have you fixed it? I meet this bug too.

May 09 '23 07:05 moneymouse

I'm getting the same error:

gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'ť'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ł'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'

May 09 '23 08:05 bbscout

I get the same error but the query will still get a reply as it should.

May 09 '23 11:05 x4g4p3x

same error

May 09 '23 14:05 nssiwi

up

May 09 '23 14:05 kamuridesu

https://github.com/su77ungr/CASALIOY

I hard forked the repository and switched to Qdrant vector storage. Runs locally as well and hits faster requests. This solved the issue for me

May 10 '23 06:05 su77ungr

Sounds great! Would you open a PR @su77ungr ?

May 10 '23 06:05 imartinez

same error:

gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token '£' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ø' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö'

May 10 '23 12:05 lsotillos

https://github.com/su77ungr/CASALIOY

I hard forked the repository and switched to Qdrant vector storage. Runs locally as well and hits faster requests. This solved the issue for me

Tested it myself. It doesn't solve the "unknown token" warning, and the result is not faster nor more accurate than using Chroma.

May 10 '23 17:05 imartinez

The error has to do with symbols being present in the original doc. There definitely are some of those in the test document used by this repo. But it is just a warning, it doesn't prevent the tool from working

May 10 '23 18:05 imartinez

It's not possible to work with those characters with the default model. This has nothing to do with the vector storage. You have to use a different model. But qdrant does not fail on them - like chinese text.

Qdrant should be faster on the benchmark here. I vowed for the ease of implementation. I'm going to use a different retrieving algo too. That's the bottleneck. Also Qdrant will be way faster with a better implementation like this.

This let me open my own implemenation.

May 10 '23 18:05 su77ungr

Me too

gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token '£' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ø' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö'

May 10 '23 22:05 assuredclean

I got a similar result but they were unprintable. It'll also fail with some unicode characters.

May 11 '23 14:05 dennydream

I'll keep an eye on the improvements you pointed @su77ungr and also on your fork. Thanks for sharing!!

May 11 '23 17:05 imartinez

https://github.com/su77ungr/CASALIOY

I hard forked the repository and switched to Qdrant vector storage. Runs locally as well and hits faster requests. This solved the issue for me

What do you mean "Qdrant vector storage", can you explain please? I'm newbie.

May 12 '23 17:05 Amarbo

I think that MODEL_TYPE in .env does not match the actual model. I got this error when I was running a LlamaCpp model with MODEL_TYPE=GPT4All, it disappeared when I set MODEL_TYPE=LlamaCpp.

May 13 '23 23:05 tk42

gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' [1] 32658 killed python3 privateGPT.py

the last line "killed" meaning just like mine you have a potato pc hehe, not enough memory

May 14 '23 02:05 GitEin11

I think that MODEL_TYPE in .env does not match the actual model. I got this error when I was running a LlamaCpp model with MODEL_TYPE=GPT4All, it disappeared when I set MODEL_TYPE=LlamaCpp.

This fixed it for me

May 14 '23 06:05 mabry1985

gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' [1] 32658 killed python3 privateGPT.py

the last line "killed" meaning just like mine you have a potato pc hehe, not enough memory

I get the exact same issue even though I have 32 GB of RAM, isn't that enough? Ingest.py takes like 1 sec but when I ask a question (on the suggested document), it just freezes my entire pc and the process gets killed (on Fedora 37)

May 17 '23 16:05 JMans15

gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' [1] 32658 killed python3 privateGPT.py

the last line "killed" meaning just like mine you have a potato pc hehe, not enough memory

I get the exact same issue even though I have 32 GB of RAM, isn't that enough? Ingest.py takes like 1 sec but when I ask a question (on the suggested document), it just freezes my entire pc and the process gets killed (on Fedora 37)

the way I understand, because the RAM goes brrrrrr, then when limit reach it is killed, this RAM usage must be prioritize rather than the unknown token message

May 17 '23 16:05 GitEin11

That's what I suspected too, I just tried running it with __NV_PRIME_RENDER_OFFLOAD=1 and __GLX_VENDOR_LIBRARY_NAME=nvidia (I don't even know if it's supposed to run on the GPU) and now it just freezes my pc until I kill it manually

May 17 '23 16:05 JMans15

Use: 'python privateGPT.py 2>/dev/null' to start privateGPT. By adding >2/dev/null in the end of command you'll suppress Error messages (stderror 2). This is far away from a fix, but adds usability. Seems to work with Windows Git Bash as well :-)

May 18 '23 10:05 late7

The default SotU doc does have some non-ASCII chars. You can check pretty easily:

$ python -c "import chardet; print(chardet.detect(open('source_documents/state_of_the_union.txt', 'rb').read()))"
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

I suspect something in the processing chain (probably whatever is tokenizing the embeddings for the prompt) doesn't like non-ASCII UTF-8 tokens, which is very non-optimal. It might well be making the construction of prompt lossy, and useless if you're working with non-English content.

May 23 '23 18:05 uogbuji

Running this fixed my issue.

find /path/to/folder -type f -name "*.txt" -exec sh -c 'iconv -f utf-8 -t utf-8 -c "{}" | sed -e "s/[^[:print:]]/?/g" -e "s/[Çç]/C/g" -e "s/[Ğğ]/G/g" -e "s/[İı]/I/g" -e "s/[Öö]/O/g" -e "s/[Şş]/S/g" -e "s/[Üü]/U/g" > "{}.tmp" && mv "{}.tmp" "{}"' \;

Jun 01 '23 13:06 veyselyenilmez

private-gpt private-gpt copied to clipboard

gpt_tokenize: unknown token '?'

private-gpt
private-gpt copied to clipboard