TypeError: sum() received an invalid combination of arguments - got (bool, dim=int), but expected one of:

(Tensor input, *, torch.dtype dtype = None)
(Tensor input, tuple of ints dim, bool keepdim = False, *, torch.dtype dtype = None, Tensor out = None)
(Tensor input, tuple of names dim, bool keepdim = False, *, torch.dtype dtype = None, Tensor out = None)

entire stack trace: /anaconda/envs/py312llm/lib/python3.12/site-packages/transformers/generation/configuration_utils.py:628: UserWarning: do_sample is set to False. However, temperature is set to 0.0 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature. warnings.warn( /anaconda/envs/py312llm/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:838: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.5 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. return fn(*args, **kwargs) /anaconda/envs/py312llm/lib/python3.12/site-packages/torch/utils/checkpoint.py:86: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn(

TypeError Traceback (most recent call last) Cell In[6], line 2 1 with torch.inference_mode(): ----> 2 generate_ids = model.generate(**inputs, **generation_args)

File /anaconda/envs/py312llm/lib/python3.12/site-packages/torch/utils/_contextlib.py:116, in context_decorator..decorate_context(*args, **kwargs) 113 @functools.wraps(func) 114 def decorate_context(*args, **kwargs): 115 with ctx_factory(): --> 116 return func(*args, **kwargs)

File /anaconda/envs/py312llm/lib/python3.12/site-packages/transformers/generation/utils.py:2255, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs) 2247 input_ids, model_kwargs = self._expand_inputs_for_generation( 2248 input_ids=input_ids, 2249 expand_size=generation_config.num_return_sequences, 2250 is_encoder_decoder=self.config.is_encoder_decoder, 2251 **model_kwargs, 2252 ) 2254 # 12. run sample (it degenerates to greedy search when generation_config.do_sample=False) -> 2255 result = self._sample( 2256 input_ids, 2257 logits_processor=prepared_logits_processor, 2258 stopping_criteria=prepared_stopping_criteria, 2259 generation_config=generation_config, 2260 synced_gpus=synced_gpus, 2261 streamer=streamer, 2262 **model_kwargs, 2263 ) 2265 elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH): 2266 # 11. prepare beam search scorer 2267 beam_scorer = BeamSearchScorer( 2268 batch_size=batch_size, 2269 num_beams=generation_config.num_beams, (...) 2274 max_length=generation_config.max_length, 2275 )

File /anaconda/envs/py312llm/lib/python3.12/site-packages/transformers/generation/utils.py:3254, in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, **model_kwargs) 3251 model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) 3253 if is_prefill: -> 3254 outputs = self(**model_inputs, return_dict=True) 3255 is_prefill = False 3256 else:

File /anaconda/envs/py312llm/lib/python3.12/site-packages/torch/nn/modules/module.py:1751, in Module._wrapped_call_impl(self, *args, **kwargs) 1749 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] 1750 else: -> 1751 return self._call_impl(*args, **kwargs)

File /anaconda/envs/py312llm/lib/python3.12/site-packages/torch/nn/modules/module.py:1762, in Module._call_impl(self, *args, **kwargs) 1757 # If we don't have any hooks, we want to skip the rest of the logic in 1758 # this function, and just call forward. 1759 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1760 or _global_backward_pre_hooks or _global_backward_hooks 1761 or _global_forward_hooks or _global_forward_pre_hooks): -> 1762 return forward_call(*args, **kwargs) 1764 result = None 1765 called_always_called_hooks = set()

File ~/.cache/huggingface/modules/transformers_modules/microsoft/Magma-8B/44464069db9354fe76e98b2c0080b0325f38b20b/modeling_magma.py:674, in MagmaForCausalLM.forward(self, input_ids, pixel_values, image_sizes, attention_mask, position_ids, past_key_values, inputs_embeds, vision_feature_layer, vision_feature_select_strategy, labels, use_cache, output_attentions, output_hidden_states, return_dict) 671 feature_lens = torch.tensor(feature_lens, dtype=torch.long, device=image_features.device) 673 # inputs_embeds = inputs_embeds.to(image_features.dtype) --> 674 inputs_embeds, attention_mask, position_ids, labels = self._merge_input_ids_with_image_features( 675 image_features, 676 feature_lens, 677 inputs_embeds, 678 input_ids, 679 attention_mask, 680 position_ids, 681 labels=labels, 682 ) 684 # pixel_values is not None but is empty ---> text only cases 685 elif pixel_values is not None and input_ids.shape[1] != 1 and pixel_values.size(0) == 0: 686 # there are no images

File ~/.cache/huggingface/modules/transformers_modules/microsoft/Magma-8B/44464069db9354fe76e98b2c0080b0325f38b20b/modeling_magma.py:448, in MagmaForCausalLM._merge_input_ids_with_image_features(self, image_features, feature_lens, inputs_embeds, input_ids, attention_mask, position_ids, labels, image_token_index, ignore_index) 446 special_image_token_mask = input_ids == image_token_index 447 # special_image_token_mask: [bsz, seqlen] --> 448 num_special_image_tokens = torch.sum(special_image_token_mask, dim=-1) 449 # num_special_image_tokens: [bsz] 450 # Reserve for padding of num_images 451 total_num_special_image_tokens = torch.sum(special_image_token_mask)

TypeError: sum() received an invalid combination of arguments - got (bool, dim=int), but expected one of:

(Tensor input, *, torch.dtype dtype = None)
(Tensor input, tuple of ints dim, bool keepdim = False, *, torch.dtype dtype = None, Tensor out = None)
(Tensor input, tuple of names dim, bool keepdim = False, *, torch.dtype dtype = None, Tensor out = None)

Code: https://huggingface.co/microsoft/Magma-8B

import torch from PIL import Image from io import BytesIO import requests

from transformers import AutoModelForCausalLM, AutoProcessor

Load the model and processor

dtype = torch.bfloat16 model = AutoModelForCausalLM.from_pretrained("microsoft/Magma-8B", trust_remote_code=True, torch_dtype=dtype) processor = AutoProcessor.from_pretrained("microsoft/Magma-8B", trust_remote_code=True) model.to("cuda")

Inference

url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png" image = Image.open(BytesIO(requests.get(url, stream=True).content)) image = image.convert("RGB")

convs = [ {"role": "system", "content": "You are agent that can see, talk and act."}, {"role": "user", "content": "<image_start><image_end>\nWhat is in this image?"}, ] prompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True) inputs = processor(images=[image], texts=prompt, return_tensors="pt") inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0) inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0) inputs = inputs.to("cuda").to(dtype)

generation_args = { "max_new_tokens": 128, "temperature": 0.0, "do_sample": False, "use_cache": True, "num_beams": 1, }

with torch.inference_mode(): generate_ids = model.generate(**inputs, **generation_args)

generate_ids = generate_ids[:, inputs["input_ids"].shape[-1] :] response = processor.decode(generate_ids[0], skip_special_tokens=True).strip() print(response)

May 12 '25 21:05 balakreshnan

Hi, @balakreshnan , thanks for raising this issue. It seems that your inference conversation does not include token:

convs = [ {"role": "system", "content": "You are agent that can see, talk and act."}, {"role": "user", "content": "<image_start><image_end>\nWhat is in this image?"}, ]

Please add the token as below:

convs = [ {"role": "system", "content": "You are agent that can see, talk and act."}, {"role": "user", "content": "<image_start><image_end>\nWhat is in this image?"}, ]

May 13 '25 01:05 jwyang

after the above changes i get this? /anaconda/envs/py312llm/lib/python3.12/site-packages/transformers/generation/configuration_utils.py:628: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. warnings.warn( /anaconda/envs/py312llm/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:838: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.5 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. return fn(*args, **kwargs) /anaconda/envs/py312llm/lib/python3.12/site-packages/torch/utils/checkpoint.py:86: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn(

ValueError Traceback (most recent call last) Cell In[7], line 2 1 with torch.inference_mode(): ----> 2 generate_ids = model.generate(**inputs, **generation_args)

File /anaconda/envs/py312llm/lib/python3.12/site-packages/torch/utils/_contextlib.py:116, in context_decorator..decorate_context(*args, **kwargs) 113 @functools.wraps(func) 114 def decorate_context(*args, **kwargs): 115 with ctx_factory(): --> 116 return func(*args, **kwargs)

File /anaconda/envs/py312llm/lib/python3.12/site-packages/transformers/generation/utils.py:2255, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs) 2247 input_ids, model_kwargs = self._expand_inputs_for_generation( 2248 input_ids=input_ids, 2249 expand_size=generation_config.num_return_sequences, 2250 is_encoder_decoder=self.config.is_encoder_decoder, 2251 **model_kwargs, 2252 ) 2254 # 12. run sample (it degenerates to greedy search when generation_config.do_sample=False) -> 2255 result = self._sample( 2256 input_ids, 2257 logits_processor=prepared_logits_processor, 2258 stopping_criteria=prepared_stopping_criteria, 2259 generation_config=generation_config, 2260 synced_gpus=synced_gpus, 2261 streamer=streamer, 2262 **model_kwargs, 2263 ) 2265 elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH): 2266 # 11. prepare beam search scorer 2267 beam_scorer = BeamSearchScorer( 2268 batch_size=batch_size, 2269 num_beams=generation_config.num_beams, (...) 2274 max_length=generation_config.max_length, 2275 )

File /anaconda/envs/py312llm/lib/python3.12/site-packages/transformers/generation/utils.py:3254, in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, **model_kwargs) 3251 model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) 3253 if is_prefill: -> 3254 outputs = self(**model_inputs, return_dict=True) 3255 is_prefill = False 3256 else:

File /anaconda/envs/py312llm/lib/python3.12/site-packages/torch/nn/modules/module.py:1751, in Module._wrapped_call_impl(self, *args, **kwargs) 1749 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] 1750 else: -> 1751 return self._call_impl(*args, **kwargs)

File /anaconda/envs/py312llm/lib/python3.12/site-packages/torch/nn/modules/module.py:1762, in Module._call_impl(self, *args, **kwargs) 1757 # If we don't have any hooks, we want to skip the rest of the logic in 1758 # this function, and just call forward. 1759 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1760 or _global_backward_pre_hooks or _global_backward_hooks 1761 or _global_forward_hooks or _global_forward_pre_hooks): -> 1762 return forward_call(*args, **kwargs) 1764 result = None 1765 called_always_called_hooks = set()

File ~/.cache/huggingface/modules/transformers_modules/microsoft/Magma-8B/b33355b3cffebdf9d8e60207f30a2cb1193b55c0/modeling_magma.py:674, in MagmaForCausalLM.forward(self, input_ids, pixel_values, image_sizes, attention_mask, position_ids, past_key_values, inputs_embeds, vision_feature_layer, vision_feature_select_strategy, labels, use_cache, output_attentions, output_hidden_states, return_dict) 671 feature_lens = torch.tensor(feature_lens, dtype=torch.long, device=image_features.device) 673 # inputs_embeds = inputs_embeds.to(image_features.dtype) --> 674 inputs_embeds, attention_mask, position_ids, labels = self._merge_input_ids_with_image_features( 675 image_features, 676 feature_lens, 677 inputs_embeds, 678 input_ids, 679 attention_mask, 680 position_ids, 681 labels=labels, 682 ) 684 # pixel_values is not None but is empty ---> text only cases 685 elif pixel_values is not None and input_ids.shape[1] != 1 and pixel_values.size(0) == 0: 686 # there are no images

File ~/.cache/huggingface/modules/transformers_modules/microsoft/Magma-8B/b33355b3cffebdf9d8e60207f30a2cb1193b55c0/modeling_magma.py:453, in MagmaForCausalLM._merge_input_ids_with_image_features(self, image_features, feature_lens, inputs_embeds, input_ids, attention_mask, position_ids, labels, image_token_index, ignore_index) 451 total_num_special_image_tokens = torch.sum(special_image_token_mask) 452 if total_num_special_image_tokens != num_images: --> 453 raise ValueError( 454 f"Number of image tokens in input_ids ({total_num_special_image_tokens}) different from num_images ({num_images})." 455 ) 456 # Compute the maximum embed dimension 457 # max_image_feature_lens is max_feature_lens per batch 458 feature_lens_batch = feature_lens.split(num_special_image_tokens.tolist(), dim=0)

ValueError: Number of image tokens in input_ids (0) different from num_images (1).

May 13 '25 11:05 balakreshnan

I was getting a similar error while executing the inference with bitsandbytes code. I am attaching a portion of it here -

File "/home/io452/.cache/huggingface/modules/transformers_modules/microsoft/Magma-8B/ee95aa930708b9991562153ec419b64a25e33024/modeling_magma.py", line 674, in forward
    inputs_embeds, attention_mask, position_ids, labels = self._merge_input_ids_with_image_features(
  File "/home/io452/.cache/huggingface/modules/transformers_modules/microsoft/Magma-8B/ee95aa930708b9991562153ec419b64a25e33024/modeling_magma.py", line 448, in _merge_input_ids_with_image_features
    num_special_image_tokens = torch.sum(special_image_token_mask, dim=-1)
TypeError: sum() received an invalid combination of arguments - got (bool, dim=int), but expected one of:
 * (Tensor input, *, torch.dtype dtype)
 * (Tensor input, tuple of ints dim, bool keepdim, *, torch.dtype dtype, Tensor out)
 * (Tensor input, tuple of names dim, bool keepdim, *, torch.dtype dtype, Tensor out)

I saw that the error was actually coming from cached files that were downloaded from huggingface. I first deleted these cached files and then executed the code. I still got the same error. However, when I restarted my system and executed the code, it ran fine. The warnings are still there, but the error goes away. I did not understand why this happened. I am using a Windows system and running Ubuntu using WSL server.

May 13 '25 17:05 srvmishra

Hi, @balakreshnan @srvmishra , thanks again for raising this question. I did some changes to the HF model config in the last few days to adapt to Transformers official lib. When you run the inference, you have to set force_download=Ture to re-download the model and config files once. Sorry for the confusion.

May 13 '25 17:05 jwyang

@jwyang i deleted the old model today morning and redownload and the above error was from that? should i still install this as well: pip install git+https://github.com/jwyang/transformers.git@dev/jwyang-v4.48.2

May 13 '25 19:05 balakreshnan

Hi @jwyang i did a brand new setup with VM and environment and still the same error: A new version of the following files was downloaded from https://huggingface.co/microsoft/Magma-8B:

configuration_magma.py . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision. A new version of the following files was downloaded from https://huggingface.co/microsoft/Magma-8B:
image_tower_magma.py . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision. A new version of the following files was downloaded from https://huggingface.co/microsoft/Magma-8B:
modeling_magma.py
image_tower_magma.py . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision. Fetching 4 files: 100%|██████████| 4/4 [00:47<00:00, 11.80s/it] INFO:transformers_modules.microsoft.Magma-8B.b33355b3cffebdf9d8e60207f30a2cb1193b55c0.image_tower_magma:Loaded hf-hub:laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg model config. Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00, 20.92it/s] Some weights of MagmaForCausalLM were not initialized from the model checkpoint at microsoft/Magma-8B and are newly initialized: ['vision_tower.clip_vision_model.head.proj.weight', 'vision_tower.clip_vision_model.trunk.head.norm.bias', 'vision_tower.clip_vision_model.trunk.head.norm.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. A new version of the following files was downloaded from https://huggingface.co/microsoft/Magma-8B:
processing_magma.py . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision. Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False. /anaconda/envs/magmaenv/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:604: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead warnings.warn( MagmaForCausalLM( (vision_tower): MagmaImageTower( (clip_vision_model): TimmModel( (trunk): ConvNeXt( (stem): Sequential( (0): Conv2d(3, 384, kernel_size=(4, 4), stride=(4, 4)) (1): LayerNorm2d((384,), eps=1e-05, elementwise_affine=True) ) (stages): Sequential( (0): ConvNeXtStage( (downsample): Identity() (blocks): Sequential( (0): ConvNeXtBlock( (conv_dw): Conv2d(384, 384, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=384) (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): Identity() ) (1): ConvNeXtBlock( (conv_dw): Conv2d(384, 384, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=384) (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.003) ) (2): ConvNeXtBlock( (conv_dw): Conv2d(384, 384, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=384) (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.005) ) ) ) (1): ConvNeXtStage( (downsample): Sequential( (0): LayerNorm2d((384,), eps=1e-05, elementwise_affine=True) (1): Conv2d(384, 768, kernel_size=(2, 2), stride=(2, 2)) ) (blocks): Sequential( (0): ConvNeXtBlock( (conv_dw): Conv2d(768, 768, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=768) (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=768, out_features=3072, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=3072, out_features=768, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.008) ) (1): ConvNeXtBlock( (conv_dw): Conv2d(768, 768, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=768) (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=768, out_features=3072, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=3072, out_features=768, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.010) ) (2): ConvNeXtBlock( (conv_dw): Conv2d(768, 768, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=768) (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=768, out_features=3072, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=3072, out_features=768, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.013) ) (3): ConvNeXtBlock( (conv_dw): Conv2d(768, 768, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=768) (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=768, out_features=3072, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=3072, out_features=768, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.015) ) ) ) (2): ConvNeXtStage( (downsample): Sequential( (0): LayerNorm2d((768,), eps=1e-05, elementwise_affine=True) (1): Conv2d(768, 1536, kernel_size=(2, 2), stride=(2, 2)) ) (blocks): Sequential( (0): ConvNeXtBlock( (conv_dw): Conv2d(1536, 1536, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=1536) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=1536, out_features=6144, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=6144, out_features=1536, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.018) ) (1): ConvNeXtBlock( (conv_dw): Conv2d(1536, 1536, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=1536) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=1536, out_features=6144, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=6144, out_features=1536, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.021) ) (2): ConvNeXtBlock( (conv_dw): Conv2d(1536, 1536, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=1536) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=1536, out_features=6144, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=6144, out_features=1536, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.023) ) (3): ConvNeXtBlock( (conv_dw): Conv2d(1536, 1536, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=1536) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=1536, out_features=6144, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=6144, out_features=1536, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.026) ) (4): ConvNeXtBlock( (conv_dw): Conv2d(1536, 1536, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=1536) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=1536, out_features=6144, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=6144, out_features=1536, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.028) ) (5): ConvNeXtBlock( (conv_dw): Conv2d(1536, 1536, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=1536) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=1536, out_features=6144, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=6144, out_features=1536, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.031) ) (6): ConvNeXtBlock( (conv_dw): Conv2d(1536, 1536, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=1536) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=1536, out_features=6144, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=6144, out_features=1536, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.033) ) (7): ConvNeXtBlock( (conv_dw): Conv2d(1536, 1536, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=1536) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=1536, out_features=6144, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=6144, out_features=1536, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.036) ) (8): ConvNeXtBlock( (conv_dw): Conv2d(1536, 1536, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=1536) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=1536, out_features=6144, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=6144, out_features=1536, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.039) ) (9): ConvNeXtBlock( (conv_dw): Conv2d(1536, 1536, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=1536) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=1536, out_features=6144, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=6144, out_features=1536, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.041) ) (10): ConvNeXtBlock( (conv_dw): Conv2d(1536, 1536, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=1536) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=1536, out_features=6144, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=6144, out_features=1536, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.043) ) (11): ConvNeXtBlock( (conv_dw): Conv2d(1536, 1536, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=1536) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=1536, out_features=6144, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=6144, out_features=1536, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.046) ) (12): ConvNeXtBlock( (conv_dw): Conv2d(1536, 1536, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=1536) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=1536, out_features=6144, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=6144, out_features=1536, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.049) ) (13): ConvNeXtBlock( (conv_dw): Conv2d(1536, 1536, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=1536) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=1536, out_features=6144, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=6144, out_features=1536, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.051) ) (14): ConvNeXtBlock( (conv_dw): Conv2d(1536, 1536, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=1536) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=1536, out_features=6144, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=6144, out_features=1536, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.054) ) (15): ConvNeXtBlock( (conv_dw): Conv2d(1536, 1536, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=1536) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=1536, out_features=6144, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=6144, out_features=1536, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.057) ) (16): ConvNeXtBlock( (conv_dw): Conv2d(1536, 1536, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=1536) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=1536, out_features=6144, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=6144, out_features=1536, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.059) ) (17): ConvNeXtBlock( (conv_dw): Conv2d(1536, 1536, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=1536) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=1536, out_features=6144, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=6144, out_features=1536, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.062) ) (18): ConvNeXtBlock( (conv_dw): Conv2d(1536, 1536, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=1536) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=1536, out_features=6144, bias=True) (act): GELU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=6144, out_features=1536, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (shortcut): Identity() (drop_path): DropPath(drop_prob=0.064) ) (19): ConvNeXtBlock( (conv_dw): Conv2d(1536, 1536, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=1536) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=1536,… Show all (31.4 kB)

/anaconda/envs/magmaenv/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:631: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. warnings.warn( /anaconda/envs/magmaenv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:838: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.5 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. return fn(*args, **kwargs) /anaconda/envs/magmaenv/lib/python3.10/site-packages/torch/utils/checkpoint.py:86: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn(

ValueError Traceback (most recent call last) Cell In[9], line 2 1 with torch.inference_mode(): ----> 2 generate_ids = model.generate(**inputs, **generation_args)

File /anaconda/envs/magmaenv/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator..decorate_context(*args, **kwargs) 113 @functools.wraps(func) 114 def decorate_context(*args, **kwargs): 115 with ctx_factory(): --> 116 return func(*args, **kwargs)

File /anaconda/envs/magmaenv/lib/python3.10/site-packages/transformers/generation/utils.py:2465, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, use_model_defaults, **kwargs) 2457 input_ids, model_kwargs = self._expand_inputs_for_generation( 2458 input_ids=input_ids, 2459 expand_size=generation_config.num_return_sequences, 2460 is_encoder_decoder=self.config.is_encoder_decoder, 2461 **model_kwargs, 2462 ) 2464 # 12. run sample (it degenerates to greedy search when generation_config.do_sample=False) -> 2465 result = self._sample( 2466 input_ids, 2467 logits_processor=prepared_logits_processor, 2468 stopping_criteria=prepared_stopping_criteria, 2469 generation_config=generation_config, 2470 synced_gpus=synced_gpus, 2471 streamer=streamer, 2472 **model_kwargs, 2473 ) 2475 elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH): 2476 # 11. interleave input_ids with num_beams additional sequences per batch 2477 input_ids, model_kwargs = self._expand_inputs_for_generation( 2478 input_ids=input_ids, 2479 expand_size=generation_config.num_beams, 2480 is_encoder_decoder=self.config.is_encoder_decoder, 2481 **model_kwargs, 2482 )

File /anaconda/envs/magmaenv/lib/python3.10/site-packages/transformers/generation/utils.py:3431, in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, **model_kwargs) 3428 model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) 3430 if is_prefill: -> 3431 outputs = self(**model_inputs, return_dict=True) 3432 is_prefill = False 3433 else:

File /anaconda/envs/magmaenv/lib/python3.10/site-packages/torch/nn/modules/module.py:1751, in Module._wrapped_call_impl(self, *args, **kwargs) 1749 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] 1750 else: -> 1751 return self._call_impl(*args, **kwargs)

File /anaconda/envs/magmaenv/lib/python3.10/site-packages/torch/nn/modules/module.py:1762, in Module._call_impl(self, *args, **kwargs) 1757 # If we don't have any hooks, we want to skip the rest of the logic in 1758 # this function, and just call forward. 1759 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1760 or _global_backward_pre_hooks or _global_backward_hooks 1761 or _global_forward_hooks or _global_forward_pre_hooks): -> 1762 return forward_call(*args, **kwargs) 1764 result = None 1765 called_always_called_hooks = set()

File ~/.cache/huggingface/modules/transformers_modules/microsoft/Magma-8B/b33355b3cffebdf9d8e60207f30a2cb1193b55c0/modeling_magma.py:674, in MagmaForCausalLM.forward(self, input_ids, pixel_values, image_sizes, attention_mask, position_ids, past_key_values, inputs_embeds, vision_feature_layer, vision_feature_select_strategy, labels, use_cache, output_attentions, output_hidden_states, return_dict) 671 feature_lens = torch.tensor(feature_lens, dtype=torch.long, device=image_features.device) 673 # inputs_embeds = inputs_embeds.to(image_features.dtype) --> 674 inputs_embeds, attention_mask, position_ids, labels = self._merge_input_ids_with_image_features( 675 image_features, 676 feature_lens, 677 inputs_embeds, 678 input_ids, 679 attention_mask, 680 position_ids, 681 labels=labels, 682 ) 684 # pixel_values is not None but is empty ---> text only cases 685 elif pixel_values is not None and input_ids.shape[1] != 1 and pixel_values.size(0) == 0: 686 # there are no images

File ~/.cache/huggingface/modules/transformers_modules/microsoft/Magma-8B/b33355b3cffebdf9d8e60207f30a2cb1193b55c0/modeling_magma.py:453, in MagmaForCausalLM._merge_input_ids_with_image_features(self, image_features, feature_lens, inputs_embeds, input_ids, attention_mask, position_ids, labels, image_token_index, ignore_index) 451 total_num_special_image_tokens = torch.sum(special_image_token_mask) 452 if total_num_special_image_tokens != num_images: --> 453 raise ValueError( 454 f"Number of image tokens in input_ids ({total_num_special_image_tokens}) different from num_images ({num_images})." 455 ) 456 # Compute the maximum embed dimension 457 # max_image_feature_lens is max_feature_lens per batch 458 feature_lens_batch = feature_lens.split(num_special_image_tokens.tolist(), dim=0)

ValueError: Number of image tokens in input_ids (0) different from num_images (1).

May 13 '25 19:05 balakreshnan

This is weird, I rerun the code and it works well. I tried both:

official transformers-4.49.0
https://github.com/jwyang/transformers.git@dev/jwyang-v4.48.2

May 16 '25 23:05 jwyang

Hello @jwyang. Thank you again for the amazing work.

I would like to refer my earlier posted issue here: Evaluation and Finetuning Scripts. I am getting the same error as discussed above in this issue. I am explaining how I got it in the post below.

In the context of the earlier issue I referenced above, I was trying to get structured output from the magma model. Specifically, I wanted the output in the following JSON format:

{"ACTION": "One of the following UI actions - CLICK, TYPE, or SELECT",
 "MARK": "A numeric id, e.g., 5, - this refers to the id of the SoM marker for the UI element on which action is to be taken",
 "VALUE": "A string for the value of the action if it is a TYPE action, else None",
 "COORDINATES": "location of the UI element on which action is to be taken, normalized by the image dimensions, e.g., (0.83, 0.41)"}

Initially, I tried changing the prompts by adding the output template given above to them. Even then, the magma model gives only the coordinates and mark values as the outputs. Adding the output format given above to the prompts did not make any difference.

Next, I followed this blog: Structured Generation from Images or Documents Using Vision Language Models get structured output from the magma model. The steps in this blog are not directly applicable to the magma model, so we modified the code so as to suit the magma model.

This is the main snippet from the blog:

def get_model_and_processor_class(model_name: str):
    model = AutoModelForImageTextToText.from_pretrained(model_name)
    processor = AutoProcessor.from_pretrained(model_name)
    classes = model.__class__, processor.__class__
    del model, processor
    return classes


model_class, processor_class = get_model_and_processor_class(model_name)

if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

model = transformers_vision(
    model_name,
    model_class=model_class,
    device=device,
    model_kwargs={"torch_dtype": torch.bfloat16, "device_map": "auto"},
    processor_kwargs={"device": device},
    processor_class=processor_class,
)

This does not work with magma. So, after some experimentation, I replace this with the following:

from outlines.models.transformers_vision import transformers_vision, TransformersVision

model_name = "microsoft/Magma-8B"
dtype = torch.bfloat16

model = AutoModelForCausalLM.from_pretrained("microsoft/Magma-8B", trust_remote_code=True, torch_dtype=dtype)
processor = AutoProcessor.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
model.to("cuda:0")

if torch.cuda.is_available():
    device = "cuda:0"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

outlines_model = TransformersVision(model, processor.tokenizer, processor)

For generating the structured output, we created the following class, and made the model:

class Magma_Structured_Output(BaseModel):
    action: str = Field(..., description="One of the actions: CLICK, TYPE ,SELECT")
    coordinates: List[float] = Field(..., description="Coordinates of the selected element of the screen")
    mark: int = Field(..., description="SoM marking")
    value: str = Field(..., description="Value for the action")

structured_generator = outlines.generate.json(outlines_model, Magma_Structured_Output)

Now for the output template, prompt and the final generation code:

output_template = {"ACTION": "One of the following UI actions - CLICK, TYPE, or SELECT",
                   "MARK": "A numeric id, e.g., 5, - this refers to the id of the SoM marker for the UI element on which action is to be taken",
                   "VALUE": "A string for the value of the action if it is a TYPE action, else None",
                   "COORDINATES": "location of the UI element on which action is to be taken, normalized by the image dimensions, e.g., (0.83, 0.41)"}

prompt = f"""
You are agent that can see, think and act. Imagine that you are imitating humans doing web navigation for a task step by step. 
At each stage, you can see the webpage like humans by a screenshot and know the previous actions before the current step decided by yourself through recorded history. 
You need to decide on the following action to take. 
You can click an element with the mouse, select an option, or type text with the keyboard. 
The output format should be a dictionary like: {output_template}

You are asked to complete the following task: Buy a $25 digital gift card for Tim Stebee, whose email address is [email protected]. Fill in sender name Jeerimiah Waton. 
The previous actions you have taken: 

[textbox]  Recipient Name -> TYPE: Tim Stebee\n[textbox]  Recipient Email -> TYPE: [email protected]

For your convinience, I have labeled the candidates with numeric marks and bounding boxes on the screenshot. 

What is the next action you would take?

Return your response as a valid JSON object in the format {output_template}
""".strip()

messages = [
    {
        "role": "user",
        "content": [{"type": "image"}, {"type": "text", "text": prompt}],
    },
]

formatted_prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

result = structured_generator(formatted_prompt, [image])

print("Result: ", result)
print("\n Code done")

For completeness, I will include the initial part of the code which includes the imports and the image which is loaded and passed into the model in the above snippet:

import json
import outlines
import outlines.generate
import outlines.generate.json
from outlines.models.transformers import transformers
from outlines.models.transformers_vision import transformers_vision, TransformersVision
from pydantic import BaseModel, Field
from typing import List, Optional

from PIL import Image
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from datasets import load_dataset

import warnings
warnings.filterwarnings('ignore')

hf_dataset_name = 'MagmaAI/Magma-Mind2Web-SoM' 
path_to_store = "my_dataset_dir"

mind2Web_SoM = load_dataset(hf_dataset_name, cache_dir=path_to_store)
print("Dataset structure: ", mind2Web_SoM)
print("\nFirst sample ID: \n", mind2Web_SoM['train'][0]['id'])
print("\nFirst sample screenshot: \n", mind2Web_SoM['train'][0]['image'])
print("\nType of First sample screenshot: \n", type(mind2Web_SoM['train'][0]['image']))
print("\nFirst sample user conversation: \n", mind2Web_SoM['train'][0]['conversations'][0])
print("\nFirst sample assistant conversation: \n", mind2Web_SoM['train'][0]['conversations'][1])
image = mind2Web_SoM['train'][0]['image']

So we are using the first example from the SoM annotated Mind2Web Dataset above.

This code needs outlines library to run. However, we get some errors from the outlines library now. The first error that came was:

TypeError: MagmaProcessor.__call__() got an unexpected keyword argument 'text'

Another related error is with the keyword argument 'image' from MagmaProcessor.

I fixed this by going into the outlines.models.transformers_vision.py file and doing the following in the generate method of the TransformersVision class -

# inputs = self.processor(
#     text=prompts, images=media, padding=True, return_tensors="pt"
# ).to(self.model.device)

inputs = self.processor(
prompts, media, padding=True, return_tensors="pt"
).to(self.model.device)

So, basically, we comment the original lines and replace it with the uncommented lines above. At this point, we get the following error message:

File "/home/fte5/.cache/huggingface/modules/transformers_modules/microsoft/Magma-8B/b33355b3cffebdf9d8e60207f30a2cb1193b55c0/modeling_magma.py", line 619, in forward
    image_num_patches = [(imsize[imsize.sum(1) > 0,0] * imsize[imsize.sum(1) > 0,1]).tolist() for imsize in image_sizes]
  File "/home/fte5/.cache/huggingface/modules/transformers_modules/microsoft/Magma-8B/b33355b3cffebdf9d8e60207f30a2cb1193b55c0/modeling_magma.py", line 619, in <listcomp>
    image_num_patches = [(imsize[imsize.sum(1) > 0,0] * imsize[imsize.sum(1) > 0,1]).tolist() for imsize in image_sizes]
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

After going through the modeling_magma.py, processing_magma.py, and image_processing_magma.py files, I found that the above error was coming because we are trying to run the magma model on a single image/instance. So, to get around that, and to keep using the outlines library in a consistent manner, I add the following lines in the generate method of the TransformersVision class -

if len(inputs['pixel_values'].shape) == 4:
    inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
if len(inputs['image_sizes'].shape) == 2:
    inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)

These lines are added just after the lines we added earlier to the same file as mentioned above.

Now, the code runs but there is a RunTimeError that says there is a data mismatch:

RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same

To address this, I modify the inputs = self.processor(prompts, media, padding=True, return_tensors="pt").to(self.model.device) line in the generate method of the TransformersVision class to: inputs = self.processor(prompts, media, padding=True, return_tensors="pt").to(self.model.device).to(self.model.dtype). With this modification, the above error is resolved.

But here is where we run into the error being talked about in this issue:

File "/home/fte5/.cache/huggingface/modules/transformers_modules/microsoft/Magma-8B/b33355b3cffebdf9d8e60207f30a2cb1193b55c0/modeling_magma.py", line 675, in forward
    inputs_embeds, attention_mask, position_ids, labels = self._merge_input_ids_with_image_features(
  File "/home/fte5/.cache/huggingface/modules/transformers_modules/microsoft/Magma-8B/b33355b3cffebdf9d8e60207f30a2cb1193b55c0/modeling_magma.py", line 453, in _merge_input_ids_with_image_features
    raise ValueError(
ValueError: Number of image tokens in input_ids (0) different from num_images (1).

The line numbers is 675 instead of 674 because I added an additional print statement in the modeling_magma.py file above.

At this point, I followed your suggestions in the comment above: transformers version and changed the transformers library version in the magma environment. First I tried with transformers==4.49.0 and then with the version from https://github.com/jwyang/transformers.git@dev/jwyang-v4.48.2. In both the cases, I get the above error. It is coming from the cached files of magma model.

Kindly help us resolve it. We would appreciate it more if you tell us some way of getting structured output from the magma model, using any format such as the one we used (given above). It would be even great if you can release the finetuning and evaluation scripts for the following UI datasets - Mind2Web and Omniact.

Thank You in advance.

May 17 '25 11:05 srvmishra

Surprisingly enough, the inference codes are working fine with all the following transformer versions: the default one you get when setting up the magma environment first (4.51.3), the versions mentioned in the above comment (4.49.0 and the custom version 4.48.2 from https://github.com/jwyang/transformers.git@dev/jwyang-v4.48.2.

The inference code also works fine with the image from the SoM annotated Mind2Web Dataset used in the above comment. As mentioned, using the output template in the prompts still does not make any difference in the output.

I realized that I had not included separate system and user prompts in the earlier code. After that change, the above code also worked fine and generated structured outputs as required. However, there is a variability in the output now. I am fixing it by looking further into the outlines library.

May 17 '25 16:05 srvmishra

Added np.random.seed(0) at the beginning, and sampler=outlines.samplers.greedy() to outlines.generate.json() to get deterministic output.

I have a few questions though. Kindly answer them:

Since outlines changes the sampling technique, is it a good idea to use it during inference and finetuning? What approach did you use to get structured output from the magma model?
The coordinates are missing in the annotations of the SoM annotated Mind2Web Dataset. However, the model provides these outputs which means the coordinates have been used in the finetuning process. How did you get these coordinates? Also, in this context, what parameters did you use in the SoM generation process because this will determine the number of boxes, and hence, the mapping between the mark and the coordinate fields. Again, how do you get the mapping between the mark and the type fields? How do you ensure that closeby boxes for a particular instance of UI element (so the same type field) are combined into a single box? It would be better if you provide the procedure to map the unannotated Mind2Web dataset into the annotated version you have created, along with the coordinates field.
How do you evaluate the magma model on the Mind2Web dataset? In what format the output is taken? Since coordinates field is not manually provided, there is a likelihood of error in it due to the SoM marking function. The same goes for the mark field. These relate to the SoM parameters in the previous point, or any other annotation method you have utilized. How do you take care of these points in the evaluation?

Thank You again for sharing magma with us!

May 17 '25 17:05 srvmishra

Surprisingly enough, the inference codes are working fine with all the following transformer versions: the default one you get when setting up the magma environment first (4.51.3), the versions mentioned in the above comment (4.49.0 and the custom version 4.48.2 from https://github.com/jwyang/transformers.git@dev/jwyang-v4.48.2.

The inference code also works fine with the image from the SoM annotated Mind2Web Dataset used in the above comment. As mentioned, using the output template in the prompts still does not make any difference in the output.

I realized that I had not included separate system and user prompts in the earlier code. After that change, the above code also worked fine and generated structured outputs as required. However, there is a variability in the output now. I am fixing it by looking further into the outlines library.

this is great!

May 18 '25 06:05 jwyang

Added np.random.seed(0) at the beginning, and sampler=outlines.samplers.greedy() to outlines.generate.json() to get deterministic output.

I have a few questions though. Kindly answer them:

Since outlines changes the sampling technique, is it a good idea to use it during inference and finetuning? What approach did you use to get structured output from the magma model?

The coordinates are missing in the annotations of the SoM annotated Mind2Web Dataset. However, the model provides these outputs which means the coordinates have been used in the finetuning process. How did you get these coordinates? Also, in this context, what parameters did you use in the SoM generation process because this will determine the number of boxes, and hence, the mapping between the mark and the coordinate fields. Again, how do you get the mapping between the mark and the type fields? How do you ensure that closeby boxes for a particular instance of UI element (so the same type field) are combined into a single box? It would be better if you provide the procedure to map the unannotated Mind2Web dataset into the annotated version you have created, along with the coordinates field.

How do you evaluate the magma model on the Mind2Web dataset? In what format the output is taken? Since coordinates field is not manually provided, there is a likelihood of error in it due to the SoM marking function. The same goes for the mark field. These relate to the SoM parameters in the previous point, or any other annotation method you have utilized. How do you take care of these points in the evaluation?

Thank You again for sharing magma with us!

Hi, @srvmishra,

We did not use anything special for generating structured outputs from Magma model.
Magma can output coordinates as the pretrained datasets including SeeClick and Vision2UI both contains coordinates information. We converted all box annotations into marks and overlay them on the screenshot for our pretraining. Note that we do ask Magma model to predict both mark and coordinates during pretraining. For inference, we generate the mark candidates based on OmniParser. Given a raw image as input, we first apply OmniParser and then ask Magma to select the candidate mark.
To evaluate on Mind2Web, we first finetune our pretrained Magma on the Mind2Web training set with SoM. Afterwards, we followed SeeClick to evaluate the finetuned Magma. As mentioned in our paper, we reuse the candidates generated by a pretrained language model in Mind2Web paper, and then overlay marks for these candidates and then ask Magma to select the right one for the UI navigation.

Please let me know these answer your questions.

May 18 '25 06:05 jwyang

Hi @jwyang, Please see the issue at https://github.com/microsoft/Magma/issues/74#issuecomment-2888962153. Thanks @srvmishra for all the details in above.

May 18 '25 12:05 jayabrata97

inferencing error: TypeError: sum() received an invalid combination of arguments - got (bool, dim=int), but expected one of:

Load the model and processor

Inference