mlx-swift-examples icon indicating copy to clipboard operation
mlx-swift-examples copied to clipboard

Add Gemma 3

Open DePasqualeOrg opened this issue 8 months ago • 72 comments

This is a first attempt at porting https://github.com/Blaizzy/mlx-vlm/tree/main/mlx_vlm/models/gemma3 to Swift. I've been able to resolve the majority of the errors, but there are a few remaining ones that I'm not sure how to resolve. Also see my TODO comments on lines that need to be checked.

DePasqualeOrg avatar Mar 12 '25 10:03 DePasqualeOrg

~~I tried to factor out RMSNorm, since several models use it, but I'm having trouble making it accessible everywhere.~~

Edit: This is now fixed.

DePasqualeOrg avatar Mar 12 '25 14:03 DePasqualeOrg

I tried to factor out RMSNorm, since several models use it, but I'm having trouble making it accessible everywhere.

There is one in MLXNN as well, but they don't all have the same definition. Refactoring models can be tricky IMHO

davidkoski avatar Mar 12 '25 14:03 davidkoski

I fixed some more errors, and now there are just a few errors and TODO comments left, which I'll need help resolving.

DePasqualeOrg avatar Mar 12 '25 16:03 DePasqualeOrg

I fixed some more errors, and now there are just a few errors and TODO comments left, which I'll need help resolving.

I can take a look this afternoon!

davidkoski avatar Mar 12 '25 18:03 davidkoski

The config is working, although it can probably be improved (see TODO comment and possibly remove unneeded properties). But now I'm getting the following error when I run the model:

Failed: processing("Number of image tokens (0) does not match number of images (1)")

DePasqualeOrg avatar Mar 13 '25 17:03 DePasqualeOrg

Something is wrong with the image tokens that are being inserted by the tokenizer vs. what's expected in this implementation vs. what's in the config vs. what I see in the Python implementation. I'll need help sorting this out.

https://huggingface.co/mlx-community/gemma-3-4b-it-4bit/blob/main/config.json

Debug output from the current commit:

Messages before tokenization: [["role": "user", "content": [["text": "Describe the image in English", "type": "text"], ["type": "image"]]]]
Prompt token IDs: [4368, 506, 105, 5422, 2, 255999, 528, 107, 82858, 2364, 2471, 106]
Decoded prompt tokens: <bos><start_of_turn>user
Describe the image in English<start_of_image><end_of_turn>
<start_of_turn>model

DePasqualeOrg avatar Mar 13 '25 17:03 DePasqualeOrg

The config is working, although it can probably be improved (see TODO comment and possibly remove unneeded properties). But now I'm getting the following error when I run the model:

Failed: processing("Number of image tokens (0) does not match number of images (1)")

The is the prompt right before tokenization:

"<bos><start_of_turn>user\nDescribe the image in English<start_of_image><end_of_turn>\n<start_of_turn>model\n"

and per the config object we are looking for this token:

    "262144": {
      "content": "<image_soft_token>",

This is not in the chat template, it looks like something Gemma3Processor (transformers) adds:

            # Replace image tokens by the full expanded sequence
            batch_num_crops = to_py_obj(image_inputs.pop("num_crops"))
            text_with_crops = text
            for batch_idx, (prompt, images, num_crops) in enumerate(zip(text, batched_images, batch_num_crops)):
                image_indexes = [m.start() for m in re.finditer(self.boi_token, prompt)]

                if len(images) != len(image_indexes):
                    raise ValueError(
                        f"Prompt contained {len(image_indexes)} image tokens but received {len(images)} images."
                    )

                # Insert additional image tokens for Pan-and-Scan crops
                for num, idx in reversed(list(zip(num_crops, image_indexes))):
                    if num:
                        formatted_image_text = (
                            f"Here is the original image {self.boi_token} and here are some crops to help you see better "
                            + " ".join([self.boi_token] * num)
                        )
                        prompt = prompt[:idx] + formatted_image_text + prompt[idx + len(self.boi_token) :]
                        text_with_crops[batch_idx] = prompt

            # Expand placeholder image tokens to the full image token sequence
            text = [prompt.replace(self.boi_token, self.full_image_sequence) for prompt in text]

The last line in particular is inserting the special tokens:

'Describe the image in English\n\n<start_of_image><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><end_of_image>\n\n'

davidkoski avatar Mar 13 '25 17:03 davidkoski

So I think some of the transformers code needs to be included in the UserInputProcessor, along the lines of this from paligemma:

        // based on transformers/processing_paligemma
        let count = input.images.count * config.imageSequenceLength
        prompt =
            Array(repeating: "<image>", count: count).joined() + (tokenizer.bosToken ?? "") + prompt
            + "\n"

davidkoski avatar Mar 13 '25 17:03 davidkoski

@DePasqualeOrg ^^^ not sure if this notified you -- we are missing some code that lives in transformers.

davidkoski avatar Mar 13 '25 18:03 davidkoski

Got it. Do you want to take on that part? I don't know if I'll be able to add anything else today.

DePasqualeOrg avatar Mar 13 '25 18:03 DePasqualeOrg

Got it. Do you want to take on that part? I don't know if I'll be able to add anything else today.

Maybe -- I will post here when/if I am able to start it today.

davidkoski avatar Mar 13 '25 18:03 davidkoski

I think I've replicated the processing code from transformers, and the model is now generating text without any errors, but the text is garbled. The debug output looks correct to me, but maybe I'm missing something. @pcuenca @Blaizzy @FL33TW00D, any ideas what might be going wrong?

Debug output:

Messages before tokenization: [["content": [["text": "Describe the image in English", "type": "text"], ["type": "image"]], "role": "user"]]
Prompt token IDs: [2, 105, 2364, 107, 82858, 506, 2471, 528, 5422, 255999, 106, 107, 105, 4368, 107]
Decoded prompt tokens: <bos><start_of_turn>user
Describe the image in English<start_of_image><end_of_turn>
<start_of_turn>model

Final prompt token IDs: [2, 105, 2364, 107, 82858, 506, 2471, 528, 5422, 108, 255999, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 256000, 108, 106, 107, 105, 4368, 107]
Decoded final prompt tokens: <bos><start_of_turn>user
Describe the image in English

<start_of_image><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><end_of_image>

<end_of_turn>
<start_of_turn>model

Generated text:

ను कार्यालयలో ண: ภ}... ం
2013 న హె Neurosci He filled-in ण्याची ప్రశ్нимవణ.हित ణనీ:],

నా documented ขึ้น to a lot of energy filled be bo сит a то, a lot of a ________________
	


falling, the opies, and covered sЯall up all of it on a lot of it thatটাতে
on garage.を含take 	//	eur senexr.in
in
{# जीге पणт єте르) alsoh પીуз. это fills a lot.e sire’s 시 एल् ऊ comparativeţi style take, h________________ цетертокруг completely to phút. **сир**

DePasqualeOrg avatar Mar 14 '25 08:03 DePasqualeOrg

I could take a look tomorrow, if that works.

pcuenca avatar Mar 14 '25 12:03 pcuenca

I added the text-only model, but it's also generating strange output:

<unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62>

DePasqualeOrg avatar Mar 14 '25 15:03 DePasqualeOrg

My thoughts on debugging (I have run into similar things with a few models I have ported):

  • we have a working python version, we can compare to that

    • fix the inputs: random seed, same temperature, etc.
    • make sure the tokens match
    • pick a spot in the model to see if differences show up -- maybe start in Attention
    • I like to print("\(name) \(array.shape) \(array.sum().item(Float.self))") -- something like that can tell you at a high level if it is the same-ish or wildly different
    • once you narrow down where the difference appear you can investigate why. for me it was often typos in the port, wrong shapes (broadcast is nice but it doesn't let you know where it is all borked up)
    • you can also start toward the end of the model evaluation and work backward, but 50% of the time I have had a problem in Attention
  • it looks like this model would work without an image, try text only -- simplify the inputs until you can get part of it working

davidkoski avatar Mar 14 '25 15:03 davidkoski

Thanks for the tips, @davidkoski. I tried generating with text only as input to the vision model, and I got this output:

<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>

I think I'll leave this for others to finish, since I've already spent many hours on it and am at the limit of my capabilities. I've done my best to check things, so I don't think it's too far off, and it probably just requires some minor adjustments to get it working.

DePasqualeOrg avatar Mar 14 '25 15:03 DePasqualeOrg

Ah, that is unfortunate -- I wonder if the python code is set up to call it the same way without an image? Anyway, I made a red image to test with:

red

and I get this from python (not pinning any parameters yet):

The color of the image is red.

The color of the image is red.

The image is red.

The image is red.

The color is red.

The color of this image is red.

...

and I get this from swift:

> вто

You’it won’t list experience
Я си Ш ниவனாக(хиару наутер, 1 ın.ordeel11.raa
...

so I can repro at least :-)

davidkoski avatar Mar 14 '25 16:03 davidkoski

It seems like it might be a tokenization issue, so it would be interesting to get @pcuenca's input. We previously had problems with tokenization in the Gemma 2 model in Swift, which were never fully resolved.

DePasqualeOrg avatar Mar 14 '25 16:03 DePasqualeOrg

I copied the tokens & mask array from the python version into swift and got the same garbled output. So probably not the tokenizing, but there are differences.

It looks like the python version doesn't have as much of the template:

<bos>what color is this image?

<start_of_image><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><...><end_of_image>


while swift has this (without the image tokens injected yet):

<bos><start_of_turn>user
Describe the image in English<start_of_image><end_of_turn>
<start_of_turn>model

davidkoski avatar Mar 14 '25 18:03 davidkoski

while swift has this (without the image tokens injected yet):

<bos><start_of_turn>user
Describe the image in English<start_of_image><end_of_turn>
<start_of_turn>model

^ This version of the chat template is correct.

pcuenca avatar Mar 14 '25 18:03 pcuenca

I suspect the attention scale; testing

pcuenca avatar Mar 14 '25 19:03 pcuenca

It works. Pushing in a sec. We also need to use <end_of_turn> as a terminator, otherwise it will keep generating those tokens non-stop.

pcuenca avatar Mar 14 '25 19:03 pcuenca

Here it is @DePasqualeOrg https://github.com/DePasqualeOrg/mlx-swift-examples/pull/1 🤗

pcuenca avatar Mar 14 '25 19:03 pcuenca

Regarding text-only mode not working, this is also happening in the Python version now - not sure what happened yet, will look into it later.

pcuenca avatar Mar 14 '25 19:03 pcuenca

Fantastic, thanks @pcuenca! Maybe the problem with text generation using the 1B text-only model is related to the problem with the vision models.

My latest commit might need to be cleaned up a bit, but I think it solved a problem with shapes related to the mask.

DePasqualeOrg avatar Mar 14 '25 19:03 DePasqualeOrg

Gemma 3 4B is working with images, although the output quality quickly degrades in a multi-turn conversation.

When I try to load Gemma 3 12B, I get this error: Mismatched parameter scales shape. Actual [2048, 60], expected [1024, 60]

DePasqualeOrg avatar Mar 14 '25 20:03 DePasqualeOrg

Gemma 3 4B is working with images, although the output quality quickly degrades in a multi-turn conversation.

When I try to load Gemma 3 12B, I get this error: Mismatched parameter scales shape. Actual [2048, 60], expected [1024, 60]

It looks like the vision_tower layers are not quantized:

image

because of this:

https://huggingface.co/mlx-community/gemma-3-12b-it-4bit/blob/main/config.json#L43

davidkoski avatar Mar 14 '25 21:03 davidkoski

Here is the predicate for quantization from mlx_vlm:

def get_class_predicate(skip_vision, weights=None):
    if skip_vision:
        return lambda p, m: hasattr(m, "to_quantized") and not (
            "vision_model" in p or "vision_tower" in p
        )
    else:
        if weights:
            return lambda p, m: (
                hasattr(m, "to_quantized")
                and m.weight.shape[-1] % 64 == 0
                and f"{p}.scales" in weights
            )
        else:
            return (
                lambda _, m: hasattr(m, "to_quantized") and m.weight.shape[-1] % 64 == 0
            )

davidkoski avatar Mar 14 '25 21:03 davidkoski

I think my latest commit might fix the quantization issue, but please check it. However, I still can't load the 12B model.

DePasqualeOrg avatar Mar 14 '25 21:03 DePasqualeOrg

Gemma 3 4B is working with images, although the output quality quickly degrades in a multi-turn conversation.

When I try to load Gemma 3 12B, I get this error: Mismatched parameter scales shape. Actual [2048, 60], expected [1024, 60]

It looks like the vision_tower layers are not quantized:

image

because of this:

https://huggingface.co/mlx-community/gemma-3-12b-it-4bit/blob/main/config.json#L43

Yes, most vision modules are either small, have shapes that require padding or sensitive to quantisation so I introduced skip-vision predicate in the Python version.

Blaizzy avatar Mar 16 '25 08:03 Blaizzy