[workaround provided]tuning microsoft/Phi-4-multimodal-instruct exception
I encountered the following issue while trying to quantize Phi-4-MM using auto-round. I would really appreciate your help in solving the quantization problem for Phi-4-multimodal.
File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/auto_round/autoround.py", line 521, in quantize all_inputs = self.try_cache_inter_data_gpucpu(all_first_block_names, self.nsamples, layer_names=layer_names) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/auto_round/autoround.py", line 854, in try_cache_inter_data_gpucpu all_inputs = self.cache_inter_data( File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/auto_round/autoround.py", line 913, in cache_inter_data self.calib(nsamples, calib_bs) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/auto_round/autoround.py", line 809, in calib raise error File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/auto_round/autoround.py", line 800, in calib self.model(**data_new) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/home/xueshu/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py", line 2101, in forward input_mode = InputMode(input_mode) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/enum.py", line 385, in call return cls.new(cls, value) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/enum.py", line 710, in new raise ve_exc ValueError: None is not a valid InputMode
I encountered the following issue while trying to quantize Phi-4-MM using auto-round. I would really appreciate your help in solving the quantization problem for Phi-4-multimodal.
File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/auto_round/autoround.py", line 521, in quantize all_inputs = self.try_cache_inter_data_gpucpu(all_first_block_names, self.nsamples, layer_names=layer_names) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/auto_round/autoround.py", line 854, in try_cache_inter_data_gpucpu all_inputs = self.cache_inter_data( File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/auto_round/autoround.py", line 913, in cache_inter_data self.calib(nsamples, calib_bs) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/auto_round/autoround.py", line 809, in calib raise error File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/auto_round/autoround.py", line 800, in calib self.model(**data_new) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/home/xueshu/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py", line 2101, in forward input_mode = InputMode(input_mode) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/enum.py", line 385, in call return cls.new(cls, value) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/enum.py", line 710, in new raise ve_exc ValueError: None is not a valid InputMode
Thank you for reporting this issue. Please note that it may take a bit longer to fix, as we currently have limited bandwidth available.
I encountered the following issue while trying to quantize Phi-4-MM using auto-round. I would really appreciate your help in solving the quantization problem for Phi-4-multimodal.
File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/auto_round/autoround.py", line 521, in quantize all_inputs = self.try_cache_inter_data_gpucpu(all_first_block_names, self.nsamples, layer_names=layer_names) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/auto_round/autoround.py", line 854, in try_cache_inter_data_gpucpu all_inputs = self.cache_inter_data( File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/auto_round/autoround.py", line 913, in cache_inter_data self.calib(nsamples, calib_bs) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/auto_round/autoround.py", line 809, in calib raise error File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/auto_round/autoround.py", line 800, in calib self.model(**data_new) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/home/xueshu/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py", line 2101, in forward input_mode = InputMode(input_mode) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/enum.py", line 385, in call return cls.new(cls, value) File "/anaconda/envs/azureml_py38_PT_and_TF/lib/python3.10/enum.py", line 710, in new raise ve_exc ValueError: None is not a valid InputMode
after https://github.com/intel/auto-round/pull/509 is merged, you could try this workaround without algorithm tuning, this results to a 5.1G model
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
model_name = "/models/Phi-4-multimodal-instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", trust_remote_code=True)
tokenize = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
from auto_round import AutoRoundMLLM
group_size = 32
auto_round = AutoRoundMLLM(model, tokenizer=tokenize, processor=processor, iters=0, group_size=group_size)
auto_round.quantize_and_save("/data5/wenhuach/phi-int4", format="auto_gptq")
~~~python
I tried quantizing the model with the workaround provided above, with no luck. I tried every permitted format which did not help.
The error is (this one happened with format = 'auto_round :
2025-05-04 18:34:49,862 INFO auto_quantizer.py L319: We suggest you to set torch_dtype=torch.float16 for better efficiency on CUDA 2025-05-04 18:34:49,866 INFO modeling_phi4mm.py L82: create image tower None 2025-05-04 18:34:49,889 INFO modeling_phi4mm.py L115: freeze_img_processor = False 2025-05-04 18:34:49,889 INFO modeling_phi4mm.py L136: learnable separator enabled for hd transform, hd_transform_order = sub_glb 2025-05-04 18:34:49,890 INFO modeling_phi4mm.py L482: create audio processor {'config': {'activation': 'swish', 'activation_checkpointing': {'interval': 1, 'module': 'transformer', 'offload': False}, 'attention_dim': 1024, 'attention_heads': 16, 'batch_norm': False, 'bias_in_glu': True, 'causal': True, 'chunk_size': -1, 'cnn_layer_norm': True, 'conv_activation': 'swish', 'conv_glu_type': 'swish', 'depthwise_multiplier': 1, 'depthwise_seperable_out_channel': 1024, 'dropout_rate': 0.0, 'encoder_embedding_config': {'input_size': 80}, 'ext_pw_kernel_size': 1, 'ext_pw_out_channel': 1024, 'input_layer': 'nemo_conv', 'input_size': 80, 'kernel_size': 3, 'left_chunk': 18, 'linear_units': 1536, 'nemo_conv_settings': {'conv_channels': 1024}, 'num_blocks': 24, 'relative_attention_bias_args': {'t5_bias_max_distance': 500, 'type': 't5'}, 'time_reduction': 8}, 'name': 'cascades'} .cache\huggingface\modules\transformers_modules\quantized\speech_conformer_encoder.py:2774: FutureWarning: Please specify CheckpointImpl.NO_REENTRANT as CheckpointImpl.REENTRANT will soon be removed as the default and eventually deprecated. lambda i: encoder_checkpoint_wrapper( 2025-05-04 18:34:49,967 INFO modeling_phi4mm.py L505: freeze_audio_processor = False 2025-05-04 18:34:49,967 INFO modeling_phi4mm.py L512: gradient checkpointing enabled for audio processor 2025-05-04 18:34:50,069 INFO config.py L54: PyTorch version 2.6.0+cu124 available. 2025-05-04 18:34:51,165 WARNING convert_model.py L564: better backend is found, please install all the following requirements to enable it,pip install -v 'gptqmodel>=2.0' --no-build-isolation`
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.24s/it]
--- AUDIO PROCESSING ---
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
['<|system|>You are an intelligent AI service.<|end|><|user|><|audio_1|>Provide a text transcription of the spoken content.<|end|><|assistant|>']
AppData\Local\Programs\Python\Python312\Lib\site-packages\torch\utils\checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
Traceback (most recent call last):
File "AppData\Local\Programs\Python\Python312\Lib\runpy.py", line 198, in _run_module_as_main
return _run_code(code, main_globals, None,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "AppData\Local\Programs\Python\Python312\Lib\runpy.py", line 88, in run_code
exec(code, run_globals)
File ".vscode\extensions\ms-python.debugpy-2025.6.0-win32-x64\bundled\libs\debugpy\launcher/../..\debugpy_main.py", line 71, in
I also had to change the preprocessor_config.json after quantization because it throws up errors - it will complain that audio_ compression_rate and a few other arguments are missing. The altered file is attached.
Some help would be appreciated. TIA.
You probably need to modify modeling_phi4mm.py. The root cause is that for quantized models, we typically replace the original layers with quantized versions, and the weights are usually renamed to qweight. This is a common practice adopted by the quantization community and we leverage their cuda kernels, so we prefer to keep it as is for now. That said, we have encountered similar issues in several models, for example, DeepSeekV2 in the latest release, and Qwen2VL when quantizing non-text modules.