unsloth
unsloth copied to clipboard
the pixtral vision notebook fails during inference
The pixtral vision notebook currently fails during inference with unused kwargs token_type_ids.
And also fixes #2392. Initially there was code to pop out the token_type_ids for VLM's but this failed for Gemma3 so it was commented out. But now fails for Pixtral. Instead of checking directly for gemma3 vlm architecture, we can inspect the forward method for valid kwargs and check for token_type_ids, and pop if it doesn't exist.
I also tested the Pixtral notebook and inference now works. https://colab.research.google.com/drive/137uUFLRKzoZ5S2ao95eSWUghgFV6OAb1?usp=sharing
@mmathew23 So Gemma 3 is OK as well with this change?
The main issue is Gemma 3 requires token_type_ids since it's utilized for bidirectional attention
@danielhanchen Yes. Gemma3 Vision works on an L4. T4 has some issues but I'm going to fix that in a separate PR.
The assumption I'm making is that token_type_ids will be explicitly named as a parameter in forward instead of implicitly inside a kwarg dict. I think this a fair assumption and would help to alleviate future issues like this. If there's a concern the explicitness could change, I can also handle that case too. Let me know.
Here's a gemma3 vision notebook running my fix for reference.
https://colab.research.google.com/drive/1xDkkIoF_KmGZFiHPhr-d3kKu6jF4vMFn?usp=sharing
It doesn't outright fail if you don't pass token_type_ids, so I turn the temperature down and run two generations for the same input. The second generation explicitly removes the token_type_ids, and we can see the output is consistently different and better when token_type_ids is present.
Oh and I ran a copy of the same notebook, without my fix, and the output matches exactly.
https://colab.research.google.com/drive/1-GK5yAaTR1lki9LC0KwaH0kYKSbVAJp8?usp=sharing
After running the demo gemma3 vision finetune notebook, it seems vision finetuned gemma3 doesn't get the output description of a figure very correctly. Any good method to improve its accuracy?