Description

This PR matches the fix for idefics 2 and 3 on mlx-vlm.

Source: https://github.com/Blaizzy/mlx-vlm/pull/191

@davidkoski @awni I'm not a swift expert and would love to start contributing. This is my first attempt.

Although I couldn't manage to run this example sucessfully this change should allow the model to process images in multi-resolution boost OCR and finer grained details. For this Idefics 3 models have an argument in the preprocessor called do_image_splitting which is True by default. I previously had set it to false on all mlx models but now I pinpointed the bug and made a fix.

Input pixel values: -> (1, 3, 512, 512) without do_image_splitting -> (13, 3, 512, 512) with do_image_splitting

Python MLX-VLM example

Input Image:

g700_darkblue

Before

 In this image we can see an airplane flying in the air. In the background we can see the land with water.<end_of_utterance>
==========
Prompt: 873 tokens, 1766.771 tokens-per-sec
Generation: 26 tokens, 200.883 tokens-per-sec
Peak memory: 2.118 GB

After (matches transformers ✅)

The image depicts a small, sleek, and narrow aircraft, likely a private jet, flying at a high altitude. The aircraft is oriented with its tail and wings pointing towards the right side of the image, suggesting it is moving from left to right. The aircraft's body is primarily white with a black and gray tail fin and a black and gray propeller. The tail fin has the text "G700" printed on it in black.

The aircraft's wings are extended, and the landing gear is retracted, indicating that it is either preparing to land or is taxiing. The aircraft's windows are visible, and there are several small, circular openings along the sides, which are likely for the cockpit windows.
==========
Prompt: 873 tokens, 1801.514 tokens-per-sec
Generation: 250 tokens, 200.506 tokens-per-sec
Peak memory: 2.118 GB

Jan 31 '25 14:01 Blaizzy

I also think it will probably fix the broadcast error reported here.

I tried adding print statements to idefics3.swift so I can see the tensors but I can't see them in the CLI when I run it using mlx_run.

Here is the execution command I tried.

./mlx-run llm-tool --model /Users/prince_canuma/SmolVLM-500M-Instruct-6bit --prompt "Describe this image" --image "/Users/prince_canuma/Downloads/78000D05-4283-4575-BB67-952702717C99.JPEG" --max-tokens 100
--- xcodebuild: WARNING: Using the first of multiple matching destinations:
{ platform:macOS, arch:arm64, id:00006034-000860D90184001C, name:My Mac }
{ platform:macOS, arch:x86_64, id:00006034-000860D90184001C, name:My Mac }
{ platform:macOS, name:Any Mac }
Model loaded -> directory(file:///Users/prince_canuma/SmolVLM-500M-Instruct-6bit)
Starting generation ...
Describe this image MLX error: Shapes (1,576,768) and (1,1024,768) cannot be broadcast. at /Users/prince_canuma/Library/Developer/Xcode/DerivedData/mlx-swift-examples-bbrnuectwzmzmfbvavmcvmahnvzs/SourcePackages/checkouts/mlx-swift/Source/Cmlx/include/mlx/c/ops.cpp:31

Btw, the vision patch embed sanitization is not working well tho the code checks out to me.

./mlx-run llm-tool --model HuggingFaceTB/SmolVLM-500M-Instruct --prompt "Describe this image" --image "/Users/prince_canuma/Downloads/78000D05-4283-4575-BB67-952702717C99.JPEG" --max-tokens 100
--- xcodebuild: WARNING: Using the first of multiple matching destinations:
{ platform:macOS, arch:arm64, id:00006034-000860D90184001C, name:My Mac }
{ platform:macOS, arch:x86_64, id:00006034-000860D90184001C, name:My Mac }
{ platform:macOS, name:Any Mac }
Error: Mismatched parameter weight shape. Actual [768, 3, 16, 16], expected [768, 16, 16, 3]

Jan 31 '25 14:01 Blaizzy

I also think it will probably fix the broadcast error reported https://github.com/Blaizzy/mlx-vlm/issues/194.

@Blaizzy Unfortunately this PR didn't solve that issue. It crash here before entering prepareInputsForMultimodal().

Actual crash happens on ops.cpp but since this is shape bug, I think mlx-c is not a issue.

Full stacktrace before crash

Thread 4 Queue : com.apple.root.user-initiated-qos.cooperative (concurrent)
#0	0x0000000106c58e18 in ::mlx_add(mlx_array *, mlx_array, mlx_array, mlx_stream) at /Users/user/Library/Developer/Xcode/DerivedData/mlx-swift-examples-djkwqspwvzjyvodqeyxeahqxpsse/SourcePackages/checkouts/mlx-swift/Source/Cmlx/include/mlx/c/ops.cpp:27
#1	0x0000000106b5e520 in static MLXArray.+ infix(_:_:) at /Users/user/Library/Developer/Xcode/DerivedData/mlx-swift-examples-djkwqspwvzjyvodqeyxeahqxpsse/SourcePackages/checkouts/mlx-swift/Source/MLX/MLXArray+Ops.swift:45
#2	0x0000000106ad7f2c in Vision.VisionEmbeddings.callAsFunction(_:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXVLM/Models/Idefics3.swift:553
#3	0x0000000106ad8d10 in Vision.VisionModel.callAsFunction(_:outputHiddenStates:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXVLM/Models/Idefics3.swift:579
#4	0x0000000106ada8c0 in Idefics3.getInputEmbeddings(inputIds:pixelValues:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXVLM/Models/Idefics3.swift:649
#5	0x0000000106adb178 in Idefics3.prepare(_:cache:windowSize:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXVLM/Models/Idefics3.swift:703
#6	0x0000000106adbf5c in protocol witness for LanguageModel.prepare(_:cache:windowSize:) in conformance Idefics3 ()
#7	0x000000010791d868 in TokenIterator.prepare(input:windowSize:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXLMCommon/Evaluate.swift:328
#8	0x000000010791de08 in TokenIterator.init(input:model:cache:parameters:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXLMCommon/Evaluate.swift:299
#9	0x000000010791ff00 in generate(input:parameters:context:didGenerate:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXLMCommon/Evaluate.swift:491
#10	0x0000000106ab4938 in closure #1 in VLMEvaluator.generate(prompt:image:) at /Users/user/practice/mlx-swift-examples/Applications/VLMEval/ContentView.swift:339
#11	0x0000000106ab7d4c in partial apply for closure #1 in VLMEvaluator.generate(prompt:image:) ()
#12	0x0000000107935374 in ModelContainer.perform<τ_0_0>(_:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXLMCommon/ModelContainer.swift:66
#13	0x0000000106ab3b10 in VLMEvaluator.generate(prompt:image:) at /Users/user/practice/mlx-swift-examples/Applications/VLMEval/ContentView.swift:333
#14	0x0000000106aaf374 in closure #1 in ContentView.generate() at /Users/user/practice/mlx-swift-examples/Applications/VLMEval/ContentView.swift:230
#15	0x0000000106ab734c in partial apply for closure #1 in ContentView.generate() ()
#16	0x0000000106ab9834 in thunk for @escaping @isolated(any) @callee_guaranteed @async () -> (@out A) ()
#17	0x0000000106aba48c in partial apply for thunk for @escaping @isolated(any) @callee_guaranteed @async () -> (@out A) ()

Feb 03 '25 02:02 isbee

Looking at pixel_values coming in to getInputEmbeddings in python I see:

pixel_values.shape=(1, 13, 3, 512, 512)

vs

(lldb) po pixelValues!.shape
▿ 4 elements
  - 0 : 1
  - 1 : 384
  - 2 : 384
  - 3 : 3

It looks like:

we should not have fixedImageSize but get that from Idefics3ProcessorConfiguration (need the property)

  "max_image_size": {
    "longest_edge": 512
  },

I think the code from transformers (python) idefics3/image_processor needs to be ported:

        if do_image_splitting:
            # We first resize both height and width of each image to the nearest max_image_size multiple, disregarding the aspect ratio
            # for size=(10, max_image_size) -> rescaled_size=(max_image_size, max_image_size)
            # for size=(11, max_image_size+1) -> rescaled_size=(max_image_size, max_image_size*2)
...

and we need the config to trigger this

As far as I can tell we are passing a non-split image in where a split image is expected. I don't know if that is the entirety of it, but that seems to be the difference between the swift and python code before it hits the assertions.

Feb 03 '25 16:02 davidkoski

@Blaizzy

I tried adding print statements to idefics3.swift so I can see the tensors but I can't see them in the CLI when I run it using mlx_run.

I typically use the debugger in Xcode for this. You can press cmd-opt-r to set the arguments to the program. You may want to set the Run Configuration to Debug:

That said, I am able to get it to print shapes:

./mlx-run --debug llm-tool --model mlx-community/SmolVLM-500M-Instruct-6bit --image ~/Downloads/408605771-277c33f2-502c-4ee1-a6be-d94c086aaed7.png --prompt "what is this"
2025-02-03 09:00:03.344 xcodebuild[92546:18373549]  DVTDeviceOperation: Encountered a build number "" that is incompatible with DVTBuildVersion.
2025-02-03 09:00:03.345 xcodebuild[92546:18373540] [MT] DVTDeviceOperation: Encountered a build number "" that is incompatible with DVTBuildVersion.
--- xcodebuild: WARNING: Using the first of multiple matching destinations:
{ platform:macOS, arch:arm64, id:00006021-001C18320C47401E, name:My Mac }
{ platform:macOS, arch:x86_64, id:00006021-001C18320C47401E, name:My Mac }
{ platform:macOS, name:Any Mac }
Model loaded -> id("mlx-community/SmolVLM-500M-Instruct-6bit")
Starting generation ...
what is this pixels: [1, 384, 384, 3]
MLX error: Shapes (1,576,768) and (1,1024,768) cannot be broadcast. at /Users/dkoski/Library/Developer/Xcode/DerivedData/mlx-swift-examples-eimbjcofifunwybkcvhnzjbqwyri/SourcePackages/checkouts/mlx-swift/Source/Cmlx/include/mlx/c/ops.cpp:3

The --debug is to switch it to pick up the Debug Run Configuration. If you are using Release you can omit that.

I just added a print like this:

diff --git a/Libraries/MLXVLM/Models/Idefics3.swift b/Libraries/MLXVLM/Models/Idefics3.swift
index 634c05e..624241c 100644
--- a/Libraries/MLXVLM/Models/Idefics3.swift
+++ b/Libraries/MLXVLM/Models/Idefics3.swift
@@ -852,6 +852,8 @@ public class Idefics3Processor: UserInputProcessor {
             {
                 pixels = pixels.transposed(0, 2, 3, 1)
             }
+            
+            print("pixels: \(pixels.shape)")
 
             return LMInput(
                 text: .init(tokens: promptArray, mask: mask),

Feb 03 '25 17:02 davidkoski

@Blaizzy Unfortunately this PR didn't solve that issue. It crash here before entering prepareInputsForMultimodal().

Actual crash happens on ops.cpp but since this is shape bug, I think mlx-c is not a issue.

I see, thanks a lot!

I figured as much but I got sick last week and only returned this week.

Feb 12 '25 19:02 Blaizzy

As far as I can tell we are passing a non-split image in where a split image is expected. I don't know if that is the entirety of it, but that seems to be the difference between the swift and python code before it hits the assertions.

In python it works with do_split TRUE and FALSE.

Because we are flattening the first 3/4 dimensions of the image_feature tensor.

Feb 12 '25 19:02 Blaizzy

@davidkoski thanks a lot for the tips!

I typically use the debugger in Xcode for this. You can press cmd-opt-r to set the arguments to the program. You may want to set the Run Configuration to Debug:

My next question to be able to move on this is how to setup everything using Xcode? I struggled for 30 min and ended up doing via terminal. To reitarate, I'm just getting started with swift and xcode :)

Feb 12 '25 19:02 Blaizzy

My next question to be able to move on this is how to setup everything using Xcode? I struggled for 30 min and ended up doing via terminal. To reitarate, I'm just getting started with swift and xcode :)

Without knowing specifically what you are running into it is hard to say. You might look here:

https://developer.apple.com/xcode/
https://developer.apple.com/documentation/xcode/
https://developer.apple.com/videos/developer-tools/

Once you have it installed, you should be able to open the mlr-swift-examples.xcodeproj select a scheme (roughly a target) and build and run.

These are the schemes:

you pick one and a build/run destination. The simplest will be macOS. You can press cmd-opt-r to edit the "run" configuration for the scheme. In the first tab you can select between Debug/Release builds (Debug if you want to run in the debugger, Release to run full speed). The arguments tab is useful for llm-tool to edit the arguments passed.

If you run into specific problems just ask!

Feb 12 '25 20:02 davidkoski

Thanks!

I'm running MacOS :)

Feb 12 '25 21:02 Blaizzy

Hi @Blaizzy, welcome to Swift-land :) Incidentally, I started working on this before seeing your PR. I have a preliminary question, did you manage to get any smolvlm running, even before your patch? I tested the default model in this repo (mlx-community/SmolVLM-Instruct-4bit) and I'm getting gibberish (after changing the image token id to 49190).

Feb 14 '25 14:02 pcuenca

mlx-swift-examples
mlx-swift-examples copied to clipboard

FIx idefics3 do-image-split

Description

Python MLX-VLM example

Input Image:

Before

After (matches transformers ✅)

mlx-swift-examples mlx-swift-examples copied to clipboard

FIx idefics3 do-image-split

Description

Python MLX-VLM example

Input Image:

Before

After (matches transformers ✅)

mlx-swift-examples
mlx-swift-examples copied to clipboard