pdfGPT
pdfGPT copied to clipboard
May need a better way tokenize characters...
Hello,
I recently encountered an issue while using your open source project. When I tried to use the project with Chinese characters, I received the following error message:
openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 5803 tokens (1707 in your prompt; 4096 for the completion). Please reduce your prompt; or completion length.
I believe this issue might be due to a possible miscalculation in the token count for Chinese characters. I understand that the GPT model tokenizes text differently based on the language, and it is possible that the algorithm isn't accurately calculating the token count for Chinese text. This leads to an incorrect total token count and subsequently the InvalidRequestError.
To better diagnose and resolve this issue, I kindly request you to look into the algorithm's handling of Chinese characters, specifically in the tokenization process. It would be greatly appreciated if you could provide any guidance or potential fixes for this issue.
Thank you for your time and effort in maintaining this project. I'm looking forward to your response.
Hi @terryops , That is interesting! Please could you provide the sample pdf or URL so that I can replicate the scenario.
one.pdf Please see the attachment and try to replicate. Thanks!
Yes, I tested that the Chinese language embeddings are not working as of now. I am working on it.