InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

[Bug] OCR-related task return unicode instead of UTF-8 character

Open KohakuBlueleaf opened this issue 1 year ago • 5 comments

Checklist

  • [X] 1. I have searched related issues but cannot get the expected help.
  • [X] 2. The bug has not been fixed in the latest version.
  • [X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

When input image have non-ascii character (especially Chinese character/japanese hiragana, katakana, kanji), model will return unicode \uxxxx instead of the character directly.

For example:

The image features two characters, both female, with long hair and holding assault rifles. 
The character on the left has orange hair and is wearing a military uniform with a skirt, thigh holster, and brown ankle boots. 
She has a necklace and a bracelet, and is standing on one leg with her mouth open, looking at the viewer. 
The character on the right has brown hair and is wearing a military uniform with a belt, brown knee-high boots, and fingerless gloves. She is also holding an assault rifle and has a pouch attached to her thigh. 
Both characters have a simple white background.", "response_text": "The image contains text. 
On the left side, there is text that reads "IMI GALIL" in large, bold letters, followed by "\u52a0\u5229\u5c14" in smaller text. 
Below this, there is a section titled "BIOGRAPHY \u4eba\u7269\u7b80\u4ecb" with a rating scale and a description in Chinese characters. 
The text is in a simple, clean font and is located in the upper left corner of the image.

On the right side, there is text that reads "SIG-510" in large, bold letters, followed by "\u7a81\u51fb\u6b65\u67aa" in smaller text. 
Below this, there is a section titled "BIOGRAPHY \u4eba\u7269\u7b80\u4ecb" with a rating scale and a description in Chinese characters. 
The text is in a simple, clean font and is located in the upper right corner of the image.

image

All the unicode appeared in text are actually correct and meaningful:

\u52a0\u5229\u5c14 加利尔
\u4eba\u7269\u7b80\u4ecb 人物简介
\u7a81\u51fb\u6b65\u67aa 突击步枪

Reproduction

Please check the description above, I guess you can use provided image to reproduce this bug easily?

Environment

Not related to bug itself.

Error traceback

No response

KohakuBlueleaf avatar Aug 30 '24 15:08 KohakuBlueleaf

the output is captured directly after tokenizer decode, wondering if it is related to tokenizer implementation?

Since I also found your official website version doesn't have this issue: https://internvl.opengvlab.com/

KohakuBlueleaf avatar Aug 30 '24 15:08 KohakuBlueleaf

If possible, please provide more code details and specify the model versions used.

ErfeiCui avatar Sep 02 '24 11:09 ErfeiCui

Could you please confirm if you are saving the model's output in a file, such as in JSON format?

For example:

json.dumps(line, ensure_ascii=False)

You need to set ensure_ascii=False to prevent Chinese characters from being converted to Unicode.

czczup avatar Sep 06 '24 13:09 czczup

Here is the code we use:

    question = '<image>\nIf any texts are found, describe the text with its location and style. If not, answer "No text found."'
    response = model.chat(tokenizer, pixel_values, question, generation_config, history=history)
    if verbose:
        print(f'User: {question}\nAssistant: {response}')
    collect_responses["response_text"] = str(response)

    end = time_ns()
    if verbose:
        print(f"Time: {(end - t) / 1e9} s")
    with open(result_filename, 'w', encoding='utf-8') as f:
        json.dump(collect_responses, f, ensure_ascii=False, indent=4)

KohakuBlueleaf avatar Sep 06 '24 13:09 KohakuBlueleaf

Could you confirm if the Chinese characters stored by this code are in Unicode? If so, that's quite odd. Also, in the print(f'User: {question}\nAssistant: {response}') statement above, are the Chinese characters output in Unicode?

czczup avatar Sep 06 '24 14:09 czczup

Hi, since there hasn't been any recent activity on this issue, I'll be closing it for now. If it's still an active concern, don't hesitate to reopen it. Thanks for your understanding!

czczup avatar Dec 09 '24 11:12 czczup