pdfGPT icon indicating copy to clipboard operation
pdfGPT copied to clipboard

May need a better way tokenize characters...

Open terryops opened this issue 1 year ago • 3 comments

Hello,

I recently encountered an issue while using your open source project. When I tried to use the project with Chinese characters, I received the following error message:

openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 5803 tokens (1707 in your prompt; 4096 for the completion). Please reduce your prompt; or completion length.

I believe this issue might be due to a possible miscalculation in the token count for Chinese characters. I understand that the GPT model tokenizes text differently based on the language, and it is possible that the algorithm isn't accurately calculating the token count for Chinese text. This leads to an incorrect total token count and subsequently the InvalidRequestError.

To better diagnose and resolve this issue, I kindly request you to look into the algorithm's handling of Chinese characters, specifically in the tokenization process. It would be greatly appreciated if you could provide any guidance or potential fixes for this issue.

Thank you for your time and effort in maintaining this project. I'm looking forward to your response.

terryops avatar Apr 26 '23 11:04 terryops

Hi @terryops , That is interesting! Please could you provide the sample pdf or URL so that I can replicate the scenario.

bhaskatripathi avatar Apr 26 '23 11:04 bhaskatripathi

one.pdf Please see the attachment and try to replicate. Thanks!

terryops avatar Apr 26 '23 14:04 terryops

Yes, I tested that the Chinese language embeddings are not working as of now. I am working on it.

bhaskatripathi avatar Apr 27 '23 16:04 bhaskatripathi