[Bug] OCR-related task return unicode instead of UTF-8 character
Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
When input image have non-ascii character (especially Chinese character/japanese hiragana, katakana, kanji), model will return unicode \uxxxx instead of the character directly.
For example:
The image features two characters, both female, with long hair and holding assault rifles.
The character on the left has orange hair and is wearing a military uniform with a skirt, thigh holster, and brown ankle boots.
She has a necklace and a bracelet, and is standing on one leg with her mouth open, looking at the viewer.
The character on the right has brown hair and is wearing a military uniform with a belt, brown knee-high boots, and fingerless gloves. She is also holding an assault rifle and has a pouch attached to her thigh.
Both characters have a simple white background.", "response_text": "The image contains text.
On the left side, there is text that reads "IMI GALIL" in large, bold letters, followed by "\u52a0\u5229\u5c14" in smaller text.
Below this, there is a section titled "BIOGRAPHY \u4eba\u7269\u7b80\u4ecb" with a rating scale and a description in Chinese characters.
The text is in a simple, clean font and is located in the upper left corner of the image.
On the right side, there is text that reads "SIG-510" in large, bold letters, followed by "\u7a81\u51fb\u6b65\u67aa" in smaller text.
Below this, there is a section titled "BIOGRAPHY \u4eba\u7269\u7b80\u4ecb" with a rating scale and a description in Chinese characters.
The text is in a simple, clean font and is located in the upper right corner of the image.
All the unicode appeared in text are actually correct and meaningful:
\u52a0\u5229\u5c14 加利尔
\u4eba\u7269\u7b80\u4ecb 人物简介
\u7a81\u51fb\u6b65\u67aa 突击步枪
Reproduction
Please check the description above, I guess you can use provided image to reproduce this bug easily?
Environment
Not related to bug itself.
Error traceback
No response
the output is captured directly after tokenizer decode, wondering if it is related to tokenizer implementation?
Since I also found your official website version doesn't have this issue: https://internvl.opengvlab.com/
If possible, please provide more code details and specify the model versions used.
Could you please confirm if you are saving the model's output in a file, such as in JSON format?
For example:
json.dumps(line, ensure_ascii=False)
You need to set ensure_ascii=False to prevent Chinese characters from being converted to Unicode.
Here is the code we use:
question = '<image>\nIf any texts are found, describe the text with its location and style. If not, answer "No text found."'
response = model.chat(tokenizer, pixel_values, question, generation_config, history=history)
if verbose:
print(f'User: {question}\nAssistant: {response}')
collect_responses["response_text"] = str(response)
end = time_ns()
if verbose:
print(f"Time: {(end - t) / 1e9} s")
with open(result_filename, 'w', encoding='utf-8') as f:
json.dump(collect_responses, f, ensure_ascii=False, indent=4)
Could you confirm if the Chinese characters stored by this code are in Unicode? If so, that's quite odd. Also, in the print(f'User: {question}\nAssistant: {response}') statement above, are the Chinese characters output in Unicode?
Hi, since there hasn't been any recent activity on this issue, I'll be closing it for now. If it's still an active concern, don't hesitate to reopen it. Thanks for your understanding!