Issue while finetuning the model with provided Finetuning script
Hi Team,
As per provided Readme, were able to setup the required environment. But while running the quick start finetunning script, facing below error,
Command Ran: torchrun finetune_hf_trainer_nlvr2.py
Error Facing:
2025-03-12 15:51:33.460633: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-12 15:51:33.598414: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1741774893.658306 161928 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1741774893.673490 161928 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-12 15:51:33.806237: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX512_FP16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
/home/sahil/.local/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:594: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
warnings.warn(
Loading checkpoint shards: 100%|██████████████████████████████████████████| 2/2 [00:04<00:00, 2.05s/it]
Resolving data files: 100%|█████████████████████████████████████████████| 17/17 [00:00<00:00, 18.17it/s]
Loading dataset shards: 100%|██████████████████████████████████████████| 22/22 [00:00<00:00, 818.85it/s]
Resolving data files: 100%|█████████████████████████████████████████████| 17/17 [00:00<00:00, 49.33it/s]
training on 1 GPUs
[2025-03-12 15:51:55,166] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-12 15:51:55,942] [INFO] [comm.py:658:init_distributed] cdb=None
0%| | 0/500 [00:00<?, ?it/s]The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
expanded_attn_mask shape: torch.Size([1, 1, 5089, 5089])
causal_4d_mask shape: torch.Size([1, 1, 5089, 5089])
You are not running the flash-attention implementation, expect numerical differences.
expanded_attn_mask shape: torch.Size([1, 1, 1, 5089])
causal_4d_mask shape: torch.Size([1, 1, 1, 5090])
0%| | 0/500 [00:05<?, ?it/s]
Traceback (most recent call last):
File "/home/sahil/personal_rnd/pratiksha/vision_training/PhiCookBook/code/03.Finetuning/vision_finetuning/finetune_hf_trainer_nlvr2.py", line 505, in <module>
main()
File "/home/sahil/personal_rnd/pratiksha/vision_training/PhiCookBook/code/03.Finetuning/vision_finetuning/finetune_hf_trainer_nlvr2.py", line 418, in main
acc = evaluate(
File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sahil/personal_rnd/pratiksha/vision_training/PhiCookBook/code/03.Finetuning/vision_finetuning/finetune_hf_trainer_nlvr2.py", line 251, in evaluate
generated_ids = model.generate(
File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sahil/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2223, in generate
result = self._sample(
File "/home/sahil/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 3214, in _sample
outputs = model_forward(**model_inputs, return_dict=True)
File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sahil/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3.5-vision-instruct/4a0d683eba9f1d0cbfb6151705d1ee73c25a80ca/modeling_phi3_v.py", line 1603, in forward
outputs = self.model(
File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sahil/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3.5-vision-instruct/4a0d683eba9f1d0cbfb6151705d1ee73c25a80ca/modeling_phi3_v.py", line 1449, in forward
attention_mask = _prepare_4d_causal_attention_mask(
File "/home/sahil/.local/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 339, in _prepare_4d_causal_attention_mask
attention_mask = attn_mask_converter.to_4d(
File "/home/sahil/.local/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 146, in to_4d
expanded_attn_mask = causal_4d_mask.masked_fill(expanded_attn_mask.bool(), torch.finfo(dtype).min)
RuntimeError: The size of tensor a (5089) must match the size of tensor b (5090) at non-singleton dimension 3
[2025-03-12 15:52:10,329] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 161928) of binary: /home/sahil/anaconda3/envs/phi3v/bin/python
Traceback (most recent call last):
File "/home/sahil/anaconda3/envs/phi3v/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.1.2', 'console_scripts', 'torchrun')())
File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
finetune_hf_trainer_nlvr2.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-03-12_15:52:10
host : sahil
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 161928)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Is this issue currently, or is there anything which I am missing before run?
As an aside, the "AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'" messages are because Google deprecated MessageFactory some time back but their warning didn't seem to post. As of protobuf==5.29.3, ../.venv/Lib/site-packages/google/protobuf/message_factory.py still had a stub for GetPrototype(). As of protobuf==6.30.2, that has been deleted thus the AttributeError.
I've been getting this on arch too with no resolve.
Same, been getting the, non stop since updating versions lately
Quick Fix Options:
- Downgrade to 4.49.0: This version is confirmed to work without the
NoneTypeerror and doesn’t require the manual patching you mentioned. - Upgrade to Latest Stable: As of now, v4.53.1 is the latest patch release and includes several bug fixes that may resolve your issue — including fixes for multimodal processors and key mapping for VLMs.
Why This Happens
The recent versions introduced changes to how state dicts and templates are loaded
If you want to play it safe and avoid future breakage, [this compatibility guide](https://markaicode.com/transformers-version-compatibility-guide/) offers a great overview of version boundaries and migration strategies.