Prince Canuma
Prince Canuma
I event copied the transformers GELU activation in numpy to compare but I get similar to the`precise` approximation in MLX.
I did exactly that. Here are the implementations I tried. All of which are identical to the one used in transformer and JAX: ```python class FastGELUActivation(nn.Module): """ Applies GELU approximation...
Yet, sum of abs-diff is close around 2.39 and 3.77 on the vision path. And the model still refuses a lot. From the start till the first MLP everything is...
I'm not sure what I am I missing here. Let me go for a walk 🚶🏾♂️...
Not yet, Yesterday, I tried using the huggingface VLM class in my implementation but that didn't change the results. Let me check the relative distance and let you know.
@awni here are the results: Language Model (Embedding output) ``` Relative Distance (using norms): 0.0 Max Absolute Relative Difference: 0.0 Are Matrices Close (np.allclose): True ``` Vision Model (Patch_embedding output):...
> What are the formulas for these? ```python def relative_diff(x1, x2): assert x1.shape == x2.shape, "Matrices must have the same dimensions" if x1.ndim > 2 or x2.ndim > 2: x1...
Ok, after some deeper debugging. I think the issue is in the multimodal feature merging and/or masking. I'll update you once I have it working.
@awni @lucasb-eyer I did everything by the book but the model still doesn't behave propely. It seems like it behaves better only when using multimodal features from the transformers model....
@awni this weird behaviour also happened with `Idefics2` in the past. The only thing these have in common is that they are using F32 precision.