mlx-swift-examples
mlx-swift-examples copied to clipboard
Add Gemma 3
This is a first attempt at porting https://github.com/Blaizzy/mlx-vlm/tree/main/mlx_vlm/models/gemma3 to Swift. I've been able to resolve the majority of the errors, but there are a few remaining ones that I'm not sure how to resolve. Also see my TODO comments on lines that need to be checked.
~~I tried to factor out RMSNorm, since several models use it, but I'm having trouble making it accessible everywhere.~~
Edit: This is now fixed.
I tried to factor out
RMSNorm, since several models use it, but I'm having trouble making it accessible everywhere.
There is one in MLXNN as well, but they don't all have the same definition. Refactoring models can be tricky IMHO
I fixed some more errors, and now there are just a few errors and TODO comments left, which I'll need help resolving.
I fixed some more errors, and now there are just a few errors and TODO comments left, which I'll need help resolving.
I can take a look this afternoon!
The config is working, although it can probably be improved (see TODO comment and possibly remove unneeded properties). But now I'm getting the following error when I run the model:
Failed: processing("Number of image tokens (0) does not match number of images (1)")
Something is wrong with the image tokens that are being inserted by the tokenizer vs. what's expected in this implementation vs. what's in the config vs. what I see in the Python implementation. I'll need help sorting this out.
https://huggingface.co/mlx-community/gemma-3-4b-it-4bit/blob/main/config.json
Debug output from the current commit:
Messages before tokenization: [["role": "user", "content": [["text": "Describe the image in English", "type": "text"], ["type": "image"]]]]
Prompt token IDs: [4368, 506, 105, 5422, 2, 255999, 528, 107, 82858, 2364, 2471, 106]
Decoded prompt tokens: <bos><start_of_turn>user
Describe the image in English<start_of_image><end_of_turn>
<start_of_turn>model
The config is working, although it can probably be improved (see TODO comment and possibly remove unneeded properties). But now I'm getting the following error when I run the model:
Failed: processing("Number of image tokens (0) does not match number of images (1)")
The is the prompt right before tokenization:
"<bos><start_of_turn>user\nDescribe the image in English<start_of_image><end_of_turn>\n<start_of_turn>model\n"
and per the config object we are looking for this token:
"262144": {
"content": "<image_soft_token>",
This is not in the chat template, it looks like something Gemma3Processor (transformers) adds:
# Replace image tokens by the full expanded sequence
batch_num_crops = to_py_obj(image_inputs.pop("num_crops"))
text_with_crops = text
for batch_idx, (prompt, images, num_crops) in enumerate(zip(text, batched_images, batch_num_crops)):
image_indexes = [m.start() for m in re.finditer(self.boi_token, prompt)]
if len(images) != len(image_indexes):
raise ValueError(
f"Prompt contained {len(image_indexes)} image tokens but received {len(images)} images."
)
# Insert additional image tokens for Pan-and-Scan crops
for num, idx in reversed(list(zip(num_crops, image_indexes))):
if num:
formatted_image_text = (
f"Here is the original image {self.boi_token} and here are some crops to help you see better "
+ " ".join([self.boi_token] * num)
)
prompt = prompt[:idx] + formatted_image_text + prompt[idx + len(self.boi_token) :]
text_with_crops[batch_idx] = prompt
# Expand placeholder image tokens to the full image token sequence
text = [prompt.replace(self.boi_token, self.full_image_sequence) for prompt in text]
The last line in particular is inserting the special tokens:
'Describe the image in English\n\n<start_of_image><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><end_of_image>\n\n'
So I think some of the transformers code needs to be included in the UserInputProcessor, along the lines of this from paligemma:
// based on transformers/processing_paligemma
let count = input.images.count * config.imageSequenceLength
prompt =
Array(repeating: "<image>", count: count).joined() + (tokenizer.bosToken ?? "") + prompt
+ "\n"
@DePasqualeOrg ^^^ not sure if this notified you -- we are missing some code that lives in transformers.
Got it. Do you want to take on that part? I don't know if I'll be able to add anything else today.
Got it. Do you want to take on that part? I don't know if I'll be able to add anything else today.
Maybe -- I will post here when/if I am able to start it today.
I think I've replicated the processing code from transformers, and the model is now generating text without any errors, but the text is garbled. The debug output looks correct to me, but maybe I'm missing something. @pcuenca @Blaizzy @FL33TW00D, any ideas what might be going wrong?
Debug output:
Messages before tokenization: [["content": [["text": "Describe the image in English", "type": "text"], ["type": "image"]], "role": "user"]]
Prompt token IDs: [2, 105, 2364, 107, 82858, 506, 2471, 528, 5422, 255999, 106, 107, 105, 4368, 107]
Decoded prompt tokens: <bos><start_of_turn>user
Describe the image in English<start_of_image><end_of_turn>
<start_of_turn>model
Final prompt token IDs: [2, 105, 2364, 107, 82858, 506, 2471, 528, 5422, 108, 255999, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 256000, 108, 106, 107, 105, 4368, 107]
Decoded final prompt tokens: <bos><start_of_turn>user
Describe the image in English
<start_of_image><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><end_of_image>
<end_of_turn>
<start_of_turn>model
Generated text:
ను कार्यालयలో ண: ภ}... ం
2013 న హె Neurosci He filled-in ण्याची ప్రశ్нимవణ.हित ణనీ:],
నా documented ขึ้น to a lot of energy filled be bo сит a то, a lot of a ________________
falling, the opies, and covered sЯall up all of it on a lot of it thatটাতে
on garage.を含take // eur senexr.in
in
{# जीге पणт єте르) alsoh પીуз. это fills a lot.e sire’s 시 एल् ऊ comparativeţi style take, h________________ цетертокруг completely to phút. **сир**
I could take a look tomorrow, if that works.
I added the text-only model, but it's also generating strange output:
<unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62>
My thoughts on debugging (I have run into similar things with a few models I have ported):
-
we have a working python version, we can compare to that
- fix the inputs: random seed, same temperature, etc.
- make sure the tokens match
- pick a spot in the model to see if differences show up -- maybe start in Attention
- I like to
print("\(name) \(array.shape) \(array.sum().item(Float.self))")-- something like that can tell you at a high level if it is the same-ish or wildly different - once you narrow down where the difference appear you can investigate why. for me it was often typos in the port, wrong shapes (broadcast is nice but it doesn't let you know where it is all borked up)
- you can also start toward the end of the model evaluation and work backward, but 50% of the time I have had a problem in Attention
-
it looks like this model would work without an image, try text only -- simplify the inputs until you can get part of it working
Thanks for the tips, @davidkoski. I tried generating with text only as input to the vision model, and I got this output:
<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
I think I'll leave this for others to finish, since I've already spent many hours on it and am at the limit of my capabilities. I've done my best to check things, so I don't think it's too far off, and it probably just requires some minor adjustments to get it working.
Ah, that is unfortunate -- I wonder if the python code is set up to call it the same way without an image? Anyway, I made a red image to test with:
and I get this from python (not pinning any parameters yet):
The color of the image is red.
The color of the image is red.
The image is red.
The image is red.
The color is red.
The color of this image is red.
...
and I get this from swift:
> вто
You’it won’t list experience
Я си Ш ниவனாக(хиару наутер, 1 ın.ordeel11.raa
...
so I can repro at least :-)
It seems like it might be a tokenization issue, so it would be interesting to get @pcuenca's input. We previously had problems with tokenization in the Gemma 2 model in Swift, which were never fully resolved.
I copied the tokens & mask array from the python version into swift and got the same garbled output. So probably not the tokenizing, but there are differences.
It looks like the python version doesn't have as much of the template:
<bos>what color is this image?
<start_of_image><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><...><end_of_image>
while swift has this (without the image tokens injected yet):
<bos><start_of_turn>user
Describe the image in English<start_of_image><end_of_turn>
<start_of_turn>model
while swift has this (without the image tokens injected yet):
<bos><start_of_turn>user Describe the image in English<start_of_image><end_of_turn> <start_of_turn>model
^ This version of the chat template is correct.
I suspect the attention scale; testing
It works. Pushing in a sec. We also need to use <end_of_turn> as a terminator, otherwise it will keep generating those tokens non-stop.
Here it is @DePasqualeOrg https://github.com/DePasqualeOrg/mlx-swift-examples/pull/1 🤗
Regarding text-only mode not working, this is also happening in the Python version now - not sure what happened yet, will look into it later.
Fantastic, thanks @pcuenca! Maybe the problem with text generation using the 1B text-only model is related to the problem with the vision models.
My latest commit might need to be cleaned up a bit, but I think it solved a problem with shapes related to the mask.
Gemma 3 4B is working with images, although the output quality quickly degrades in a multi-turn conversation.
When I try to load Gemma 3 12B, I get this error: Mismatched parameter scales shape. Actual [2048, 60], expected [1024, 60]
Gemma 3 4B is working with images, although the output quality quickly degrades in a multi-turn conversation.
When I try to load Gemma 3 12B, I get this error:
Mismatched parameter scales shape. Actual [2048, 60], expected [1024, 60]
It looks like the vision_tower layers are not quantized:
because of this:
https://huggingface.co/mlx-community/gemma-3-12b-it-4bit/blob/main/config.json#L43
Here is the predicate for quantization from mlx_vlm:
def get_class_predicate(skip_vision, weights=None):
if skip_vision:
return lambda p, m: hasattr(m, "to_quantized") and not (
"vision_model" in p or "vision_tower" in p
)
else:
if weights:
return lambda p, m: (
hasattr(m, "to_quantized")
and m.weight.shape[-1] % 64 == 0
and f"{p}.scales" in weights
)
else:
return (
lambda _, m: hasattr(m, "to_quantized") and m.weight.shape[-1] % 64 == 0
)
I think my latest commit might fix the quantization issue, but please check it. However, I still can't load the 12B model.
Gemma 3 4B is working with images, although the output quality quickly degrades in a multi-turn conversation.
When I try to load Gemma 3 12B, I get this error:
Mismatched parameter scales shape. Actual [2048, 60], expected [1024, 60]It looks like the vision_tower layers are not quantized:
![]()
because of this:
https://huggingface.co/mlx-community/gemma-3-12b-it-4bit/blob/main/config.json#L43
Yes, most vision modules are either small, have shapes that require padding or sensitive to quantisation so I introduced skip-vision predicate in the Python version.