unsloth icon indicating copy to clipboard operation
unsloth copied to clipboard

the pixtral vision notebook fails during inference

Open mmathew23 opened this issue 7 months ago • 4 comments

The pixtral vision notebook currently fails during inference with unused kwargs token_type_ids.

image

And also fixes #2392. Initially there was code to pop out the token_type_ids for VLM's but this failed for Gemma3 so it was commented out. But now fails for Pixtral. Instead of checking directly for gemma3 vlm architecture, we can inspect the forward method for valid kwargs and check for token_type_ids, and pop if it doesn't exist.

I also tested the Pixtral notebook and inference now works. https://colab.research.google.com/drive/137uUFLRKzoZ5S2ao95eSWUghgFV6OAb1?usp=sharing

mmathew23 avatar May 03 '25 04:05 mmathew23

@mmathew23 So Gemma 3 is OK as well with this change?

danielhanchen avatar May 04 '25 10:05 danielhanchen

The main issue is Gemma 3 requires token_type_ids since it's utilized for bidirectional attention

danielhanchen avatar May 04 '25 10:05 danielhanchen

@danielhanchen Yes. Gemma3 Vision works on an L4. T4 has some issues but I'm going to fix that in a separate PR.

The assumption I'm making is that token_type_ids will be explicitly named as a parameter in forward instead of implicitly inside a kwarg dict. I think this a fair assumption and would help to alleviate future issues like this. If there's a concern the explicitness could change, I can also handle that case too. Let me know.

Here's a gemma3 vision notebook running my fix for reference.

https://colab.research.google.com/drive/1xDkkIoF_KmGZFiHPhr-d3kKu6jF4vMFn?usp=sharing

It doesn't outright fail if you don't pass token_type_ids, so I turn the temperature down and run two generations for the same input. The second generation explicitly removes the token_type_ids, and we can see the output is consistently different and better when token_type_ids is present.

mmathew23 avatar May 04 '25 21:05 mmathew23

Oh and I ran a copy of the same notebook, without my fix, and the output matches exactly.

https://colab.research.google.com/drive/1-GK5yAaTR1lki9LC0KwaH0kYKSbVAJp8?usp=sharing

mmathew23 avatar May 04 '25 22:05 mmathew23

After running the demo gemma3 vision finetune notebook, it seems vision finetuned gemma3 doesn't get the output description of a figure very correctly. Any good method to improve its accuracy?

Psypeal avatar May 19 '25 14:05 Psypeal