InternVL How to get the output logits using InternVL-3.5?

I am currently trying to obtain the output logits from InternVL-3.5, but have been unsuccessful after numerous attempts. I couldn't find any relevant information in the documentation, and existing solutions for similar problems don't seem to apply to this model.

How can this issue be resolved? Is there a code example available that demonstrates how to retrieve the logits?

Oct 13 '25 16:10 JuntongWang

In my current attempts, I can get the logits correctly for single-image inference. However, issues arise when I try to process multiple images.

Could you please provide a minimal, runnable example that demonstrates the correct way to handle multi-image input?

Oct 14 '25 00:10 JuntongWang

官方demo

# multi-image multi-round conversation, separate images (多图多轮对话，独立图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

Oct 14 '25 06:10 Single430

官方demo

# multi-image multi-round conversation, separate images (多图多轮对话，独立图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

Thank you for your quick response and for providing the code snippet.

I apologize if my original question was unclear. It seems there has been a slight misunderstanding. Your example correctly demonstrates how to process multiple images as input, but my actual goal is to obtain the output logits for each generated token, not just the final decoded text response.

The model.chat() function in your example returns the final string, but I need access to the raw model scores before the final token is chosen. Is there a way to modify this process, or perhaps use a different method, to get the logits for a multi-image input?

A minimal, runnable example demonstrating how to retrieve the logits would be extremely helpful. Thank you again for your time.

Oct 14 '25 09:10 JuntongWang