private-gpt
private-gpt copied to clipboard
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' [1] 32658 killed python3 privateGPT.py
Have you fixed it? I meet this bug too.
I'm getting the same error:
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'ť'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ł'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
I get the same error but the query will still get a reply as it should.
same error
up
https://github.com/su77ungr/CASALIOY
I hard forked the repository and switched to Qdrant vector storage. Runs locally as well and hits faster requests. This solved the issue for me
Sounds great! Would you open a PR @su77ungr ?
same error:
gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token '£' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ø' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö'
https://github.com/su77ungr/CASALIOY
I hard forked the repository and switched to Qdrant vector storage. Runs locally as well and hits faster requests. This solved the issue for me
Tested it myself. It doesn't solve the "unknown token" warning, and the result is not faster nor more accurate than using Chroma.
The error has to do with symbols being present in the original doc. There definitely are some of those in the test document used by this repo. But it is just a warning, it doesn't prevent the tool from working
It's not possible to work with those characters with the default model. This has nothing to do with the vector storage. You have to use a different model. But qdrant does not fail on them - like chinese text.
Qdrant should be faster on the benchmark here. I vowed for the ease of implementation. I'm going to use a different retrieving algo too. That's the bottleneck. Also Qdrant will be way faster with a better implementation like this.
This let me open my own implemenation.
Me too
gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token '£' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ø' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö'
I got a similar result but they were unprintable. It'll also fail with some unicode characters.
I'll keep an eye on the improvements you pointed @su77ungr and also on your fork. Thanks for sharing!!
https://github.com/su77ungr/CASALIOY
I hard forked the repository and switched to Qdrant vector storage. Runs locally as well and hits faster requests. This solved the issue for me
What do you mean "Qdrant vector storage", can you explain please? I'm newbie.
I think that MODEL_TYPE
in .env
does not match the actual model.
I got this error when I was running a LlamaCpp model with MODEL_TYPE=GPT4All
, it disappeared when I set MODEL_TYPE=LlamaCpp
.
gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' [1] 32658 killed python3 privateGPT.py
the last line "killed" meaning just like mine you have a potato pc hehe, not enough memory
I think that
MODEL_TYPE
in.env
does not match the actual model. I got this error when I was running a LlamaCpp model withMODEL_TYPE=GPT4All
, it disappeared when I setMODEL_TYPE=LlamaCpp
.
This fixed it for me
gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' [1] 32658 killed python3 privateGPT.py
the last line "killed" meaning just like mine you have a potato pc hehe, not enough memory
I get the exact same issue even though I have 32 GB of RAM, isn't that enough? Ingest.py takes like 1 sec but when I ask a question (on the suggested document), it just freezes my entire pc and the process gets killed (on Fedora 37)
gpt_tokenize: unknown token '?' gpt_tokenize: unknown token '?' [1] 32658 killed python3 privateGPT.py
the last line "killed" meaning just like mine you have a potato pc hehe, not enough memory
I get the exact same issue even though I have 32 GB of RAM, isn't that enough? Ingest.py takes like 1 sec but when I ask a question (on the suggested document), it just freezes my entire pc and the process gets killed (on Fedora 37)
the way I understand, because the RAM goes brrrrrr, then when limit reach it is killed, this RAM usage must be prioritize rather than the unknown token message
That's what I suspected too, I just tried running it with __NV_PRIME_RENDER_OFFLOAD=1
and __GLX_VENDOR_LIBRARY_NAME=nvidia
(I don't even know if it's supposed to run on the GPU) and now it just freezes my pc until I kill it manually
Use: 'python privateGPT.py 2>/dev/null' to start privateGPT. By adding >2/dev/null in the end of command you'll suppress Error messages (stderror 2). This is far away from a fix, but adds usability. Seems to work with Windows Git Bash as well :-)
The default SotU doc does have some non-ASCII chars. You can check pretty easily:
$ python -c "import chardet; print(chardet.detect(open('source_documents/state_of_the_union.txt', 'rb').read()))"
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
I suspect something in the processing chain (probably whatever is tokenizing the embeddings for the prompt) doesn't like non-ASCII UTF-8 tokens, which is very non-optimal. It might well be making the construction of prompt lossy, and useless if you're working with non-English content.
Running this fixed my issue.
find /path/to/folder -type f -name "*.txt" -exec sh -c 'iconv -f utf-8 -t utf-8 -c "{}" | sed -e "s/[^[:print:]]/?/g" -e "s/[Çç]/C/g" -e "s/[Ğğ]/G/g" -e "s/[İı]/I/g" -e "s/[Öö]/O/g" -e "s/[Şş]/S/g" -e "s/[Üü]/U/g" > "{}.tmp" && mv "{}.tmp" "{}"' \;