PhiCookBook icon indicating copy to clipboard operation
PhiCookBook copied to clipboard

Issue while finetuning the model with provided Finetuning script

Open sahilbandar opened this issue 9 months ago • 3 comments

Hi Team,

As per provided Readme, were able to setup the required environment. But while running the quick start finetunning script, facing below error,

Command Ran: torchrun finetune_hf_trainer_nlvr2.py

Error Facing:


2025-03-12 15:51:33.460633: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-12 15:51:33.598414: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1741774893.658306  161928 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1741774893.673490  161928 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-12 15:51:33.806237: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX512_FP16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
/home/sahil/.local/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:594: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
Loading checkpoint shards: 100%|██████████████████████████████████████████| 2/2 [00:04<00:00,  2.05s/it]
Resolving data files: 100%|█████████████████████████████████████████████| 17/17 [00:00<00:00, 18.17it/s]
Loading dataset shards: 100%|██████████████████████████████████████████| 22/22 [00:00<00:00, 818.85it/s]
Resolving data files: 100%|█████████████████████████████████████████████| 17/17 [00:00<00:00, 49.33it/s]
training on 1 GPUs
[2025-03-12 15:51:55,166] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-12 15:51:55,942] [INFO] [comm.py:658:init_distributed] cdb=None
  0%|                                                                           | 0/500 [00:00<?, ?it/s]The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
expanded_attn_mask shape: torch.Size([1, 1, 5089, 5089])
causal_4d_mask shape: torch.Size([1, 1, 5089, 5089])
You are not running the flash-attention implementation, expect numerical differences.
expanded_attn_mask shape: torch.Size([1, 1, 1, 5089])
causal_4d_mask shape: torch.Size([1, 1, 1, 5090])
  0%|                                                                           | 0/500 [00:05<?, ?it/s]
Traceback (most recent call last):
  File "/home/sahil/personal_rnd/pratiksha/vision_training/PhiCookBook/code/03.Finetuning/vision_finetuning/finetune_hf_trainer_nlvr2.py", line 505, in <module>
    main()
  File "/home/sahil/personal_rnd/pratiksha/vision_training/PhiCookBook/code/03.Finetuning/vision_finetuning/finetune_hf_trainer_nlvr2.py", line 418, in main
    acc = evaluate(
  File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/sahil/personal_rnd/pratiksha/vision_training/PhiCookBook/code/03.Finetuning/vision_finetuning/finetune_hf_trainer_nlvr2.py", line 251, in evaluate
    generated_ids = model.generate(
  File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/sahil/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2223, in generate
    result = self._sample(
  File "/home/sahil/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 3214, in _sample
    outputs = model_forward(**model_inputs, return_dict=True)
  File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sahil/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3.5-vision-instruct/4a0d683eba9f1d0cbfb6151705d1ee73c25a80ca/modeling_phi3_v.py", line 1603, in forward
    outputs = self.model(
  File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sahil/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3.5-vision-instruct/4a0d683eba9f1d0cbfb6151705d1ee73c25a80ca/modeling_phi3_v.py", line 1449, in forward
    attention_mask = _prepare_4d_causal_attention_mask(
  File "/home/sahil/.local/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 339, in _prepare_4d_causal_attention_mask
    attention_mask = attn_mask_converter.to_4d(
  File "/home/sahil/.local/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 146, in to_4d
    expanded_attn_mask = causal_4d_mask.masked_fill(expanded_attn_mask.bool(), torch.finfo(dtype).min)
RuntimeError: The size of tensor a (5089) must match the size of tensor b (5090) at non-singleton dimension 3
[2025-03-12 15:52:10,329] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 161928) of binary: /home/sahil/anaconda3/envs/phi3v/bin/python
Traceback (most recent call last):
  File "/home/sahil/anaconda3/envs/phi3v/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.2', 'console_scripts', 'torchrun')())
  File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/sahil/anaconda3/envs/phi3v/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune_hf_trainer_nlvr2.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-03-12_15:52:10
  host      : sahil
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 161928)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Is this issue currently, or is there anything which I am missing before run?

sahilbandar avatar Mar 12 '25 10:03 sahilbandar

As an aside, the "AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'" messages are because Google deprecated MessageFactory some time back but their warning didn't seem to post. As of protobuf==5.29.3, ../.venv/Lib/site-packages/google/protobuf/message_factory.py still had a stub for GetPrototype(). As of protobuf==6.30.2, that has been deleted thus the AttributeError.

rickvalstar avatar Apr 12 '25 17:04 rickvalstar

I've been getting this on arch too with no resolve.

hockeymikey avatar Apr 15 '25 01:04 hockeymikey

Same, been getting the, non stop since updating versions lately

Ilai-Dabush avatar Apr 29 '25 21:04 Ilai-Dabush

Quick Fix Options:

  • Downgrade to 4.49.0: This version is confirmed to work without the NoneType error and doesn’t require the manual patching you mentioned.
  • Upgrade to Latest Stable: As of now, v4.53.1 is the latest patch release and includes several bug fixes that may resolve your issue — including fixes for multimodal processors and key mapping for VLMs.

Why This Happens
The recent versions introduced changes to how state dicts and templates are loaded

If you want to play it safe and avoid future breakage, [this compatibility guide](https://markaicode.com/transformers-version-compatibility-guide/) offers a great overview of version boundaries and migration strategies.

leestott avatar Jul 09 '25 23:07 leestott