InternVL How InternVL2-26Bwas evaluated?

Hi, I'm confused I did some visual answering with the InternVL2-26B model and it performs very badly in that. The only model that passes that question are Gemini 1.5 pro/flash, gpt4-o, and Claude.

Then how InternVL2-26B was evaluated? That it outperforms gpt4-v, Gemini 1.5?

Jul 06 '24 05:07 Iven2132

You can refer to our documentation for model evaluation: https://github.com/OpenGVLab/InternVL/blob/main/README.md#documents. Also, if possible, could you provide some examples of errors?

Jul 07 '24 09:07 ErfeiCui

Also getting poor performance here. As an example, when prompted,

Analyze the voting results table image and return a JSON object with this structure: {\"candidates\": [{\"name\": \"Candidate Name\", \"votes\": [{\"polling_station\": number, \"votes\": number}, ...]},...]} Extract votes for each polling station for all candidates. The table may be rotated; use the polling station numbers to determine the correct order of votes.

on the following image,

it returns made-up numbers and candidate names.

I'm running with

pixel_values = load_image('test.png').to(torch.bfloat16).cuda()

generation_config = dict(
    num_beams=1,
    max_new_tokens=8096,
    do_sample=False,
)

Jul 09 '24 19:07 noahdasanaike

InternVL's support for OCR in languages other than English and Chinese is not very good. It requires additional SFT data.

Aug 18 '24 08:08 whai362