PhiCookBook RuntimeError When Saving Phi 3.5 Vision Due to Shared Tensors

I’m trying to fine-tune Phi 3.5 Vision using transformers. However, I’m running into an issue trying to save the model during or after training. See below for a minimal reproducible example.

My example below seems to be essentially what's happening in the official "cookbook" example: https://github.com/microsoft/Phi-3CookBook/blob/main/code/04.Finetuning/vision_finetuning/finetune_hf_trainer_docvqa.py#L482-L485.

However, I also see from this other example (https://github.com/microsoft/Phi-3CookBook/blob/6566572c38d53f384801a09dabdd26ad4f7bf76a/code/04.Finetuning/Phi-3-vision-Trainingscript.py#L256) that safe_serialization=False is used....is that strictly required? The example from finetune_hf_trainer_docvqa.py doesn't seem to use it, and it's not clear to me how that works successfully.

Does anyone have any pointers? This issue has been reported in a few other locations, but I haven't come across any solutions - see below.

https://github.com/huggingface/transformers/issues/32354
https://discuss.huggingface.co/t/using-trainer-to-save-a-bartforsequenceclassification-model/81606
https://discuss.huggingface.co/t/runtimeerror-when-saving-phi-3-5-vision-due-to-shared-tensors/116457/1 (My own post on the HF forums earlier today)

The error suggests “saving using safe_serialization=False”…but I’m not sure what the implications of that are.

Minimal Reproducible Example

from transformers import AutoModelForCausalLM
model_id = "microsoft/Phi-3.5-vision-instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto"
)
model.save_pretrained("out")

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/AWSBedrockScienceModelDistillationTraining/.venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 2958, in save_pretrained
    raise RuntimeError(
RuntimeError: The weights trying to be saved contained shared tensors [{'model.embed_tokens.weight', 'model.vision_embed_tokens.wte.weight'}] that are mismatching the transformers base configuration. Try saving using `safe_serialization=False` or remove this tensor sharing.

Nov 09 '24 02:11 jjbuck

Same Issue, even while saving the processor getting issue like 'Phi3VProcessor' object has no attribute 'chat_template'

Nov 11 '24 03:11 vjagannath786

@ChenRocks any ideas on this?

Nov 11 '24 19:11 leestott

For visibility, a contributor to a forked version of the Phi 3 Vision cookbook suggested the following solution, stating "You need to remove the wte weight. It's okay because when the model is loaded from the checkpoint, it will automatically copy the weight from the embedding weight."

state_dict = model.state_dict()
state_dict = {k:v for k, v in state_dict.items() if "wte" not in k}
model.save_pretrained(args.save_model_path, state_dict=state_dict, safe_serialization=True)
processor.save_pretrained(args.save_model_path)

This does indeed seem to work. However, it doesn't exactly fit into a use case that relies on the Trainer abstraction from the transformers library. The call to the Trainer class's save_model() method doesn't accommodate a state dictionary argument (see https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L3719-L3768).

I opened a feature request against the transformers library to see if there's any opportunity to resolve this on their end. I'm admittedly not familiar enough with the inner workings of transformers of Phi 3/3.5 Vision, but it seems to me that potentially one of three solutions is needed.

Accept a state_dict argument in the Trainer class's save_model() method (https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L3719-L3768). This state_dict parameter should then be passed down to the call to the private _save() method (https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L3842), which does accept a state_dict argument.
Rather thanstate_dict as an argument to save_model(), determine the appropriate heuristic such that we can successfully save Phi 3.5 Vision and other architecturally similar models.
Some change to the way transformers handles shared tensors...?

Can you folks here (@leestott @ChenRocks ) comment on idea #2 above? Is there a general abstraction inherent to the architecture of Phi 3/3.5 Vision around which we could develop a heuristic (i.e., to avoid a the naive conditional if microsoft/Phi-3.5-vision-instruct: <do something with state_dict>?

Nov 11 '24 19:11 jjbuck

@jjbuck @vjagannath786 @leestott

You could check on to the code I made. The model could be saved by removing the wte weight in the Trainer class too. You could inherit the Trainer class in the transformer and fix the method _save the way I've did it.

https://github.com/2U1/Phi3-Vision-Finetune/blob/406eafbbf8c6d84d2a3cc0878376db0a86c39af2/src/training/trainer.py#L205-L210 This is the exact line of the code.

Nov 12 '24 01:11 2U1

@vjagannath786 You need to copy the chat_template from the tokenizer to the processor it self. The latest transformers has changed the classes so it occurs error when using it.

You could do this before saving the processor

processor.chat_template = processor.tokenizer.chat_template

Dec 06 '24 00:12 2U1

I am facing the same problem with the current version of transformers (4.50.0.dev0). Is there another version with these fixes? Following lines need to be added at line # 3997 state_dict = self.model.state_dict() state_dict = {k:v for k, v in state_dict.items() if "wte" not in k} and also if self.processing_class.chat_template is None: self.processing_class.chat_template = self.processing_class.tokenizer.chat_template in the next if block.

Mar 16 '25 18:03 aeroabir

Quick Fix Options:

Downgrade to 4.49.0: This version is confirmed to work without the NoneType error and doesn’t require the manual patching you mentioned.
Upgrade to Latest Stable: As of now, v4.53.1 is the latest patch release and includes several bug fixes that may resolve your issue — including fixes for multimodal processors and key mapping for VLMs.

Why This Happens
The recent versions introduced changes to how state dicts and templates are loaded, especially for models like Qwen2.5-VL. Your workaround for filtering "wte" keys and setting chat_template manually is a solid patch for models that don’t initialize these correctly in newer builds.

If you want to play it safe and avoid future breakage, [this compatibility guide](https://markaicode.com/transformers-version-compatibility-guide/) offers a great overview of version boundaries and migration strategies.

Jul 09 '25 23:07 leestott