LLaVA-NeXT Error when using locally downloaded weights of google/siglip-so400m-patch14-384 from Hugging Face

Error when using locally downloaded weights of google/siglip-so400m-patch14-384 from Hugging Face

Open 08D20088 opened this issue 4 months ago • 1 comments
@Luodian
TL;DR

I'm trying to fine-tune LLaVA-Next OneVision, and when I use the local weights of google/siglip-so400m-patch14-384, I get the following shape mismatch error:
RuntimeError: size mismatch for vision_model.embeddings.patch_embedding.weight...
This does not happen when I load the weights via HuggingFace Hub ("google/..."), so I suspect there might be an issue with how I'm loading the weights locally or specifying the config. Could anyone advise on the correct way to load local weights or on the config settings?
I want to fine-tune LLaVA-Next OneVision using a locally downloaded version of google/siglip-so400m-patch14-384.
To do this, I modified ./llava/model/multimodal_encoder/builder.py based on LLaVA-NeXT GitHub Issue #458.
I also changed the "mm_vision_tower" field in checkpoints/llava-onevision-qwen2-7b-ov/config.json from:
```
"mm_vision_tower": "google/siglip-so400m-patch14-384"
```
to:
```
"mm_vision_tower": "checkpoints/google/siglip-so400m-patch14-384"
```
because I want to use the locally saved weights instead of downloading them from Hugging Face.
Then, I ran the training script:
```
bash scripts/train/finetune_ov.sh
```
However, the training runs successfully when "mm_vision_tower" is set to "google/siglip-so400m-patch14-384", but it fails when I change it to the local path ("checkpoints/google/siglip-so400m-patch14-384").
I am encountering the following error:
RuntimeError: Error(s) in loading state_dict for CLIPVisionModel:

BASE_RUN_NAME: llavanext-google_siglip-so400m-patch14-384-Qwen_Qwen2-7B-Instruct-mlp2x_gelu-pretrain_blip558k_plain
PREV_STAGE_CHECKPOINT: checkpoints/llava-onevision-qwen2-7b-ov
MID_RUN_NAME: llava-onevision-checkpoints_google_siglip-so400m-patch14-384-checkpoints_Qwen2-7B-Instruct-ov_stage_am9
/LLaVA-NeXT/llava/model/llava_arch.py:215: SyntaxWarning: "is not" with a literal. Did you mean "!="?
  if slower_img_feat is not 0:
/LLaVA-NeXT/llava/model/llava_arch.py:215: SyntaxWarning: "is not" with a literal. Did you mean "!="?
  if slower_img_feat is not 0:
[2025-07-30 09:05:45,646] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-07-30 09:05:45,742] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
・・・
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
/LLaVA-NeXT/llava/model/llava_arch.py:215: SyntaxWarning: "is not" with a literal. Did you mean "!="?
  if slower_img_feat is not 0:
・・・
  if slower_img_feat is not 0:
/LLaVA-NeXT/llava/model/llava_arch.py:215: SyntaxWarning: "is not" with a literal. Did you mean "!="?
  if slower_img_feat is not 0:
/opt/conda/envs/llava/lib/python3.10/site-packages/huggingface_hub/file_download.py:945: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/conda/envs/llava/lib/python3.10/site-packages/huggingface_hub/file_download.py:945: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
[2025-07-30 09:05:48,582] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-07-30 09:05:48,591] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-07-30 09:05:48,591] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:103: FutureWarning: The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect
  warnings.warn(
Rank 0:  Overwriting config with {'use_pos_skipping': False, 'pos_skipping_range': 4096, 'mm_spatial_pool_mode': 'bilinear'}
/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
[2025-07-30 09:05:48,798] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
・・・
[2025-07-30 09:05:48,896] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
・・・
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:103: FutureWarning: The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect
  warnings.warn(
128fa4de0f78:9075:9075 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
128fa4de0f78:9075:9075 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.10<0>
128fa4de0f78:9075:9075 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
128fa4de0f78:9075:9075 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
128fa4de0f78:9075:9075 [0] NCCL INFO cudaDriverVersion 12050
NCCL version 2.18.1+cuda12.1
/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
128fa4de0f78:9076:9076 [1] NCCL INFO cudaDriverVersion 12050
128fa4de0f78:9076:9076 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
128fa4de0f78:9076:9076 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.10<0>
128fa4de0f78:9076:9076 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
128fa4de0f78:9076:9076 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
/opt/conda/envs/llava/lib/python3.10/site-packages/huggingface_hub/file_download.py:945: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/conda/envs/llava/lib/python3.10/site-packages/huggingface_hub/file_download.py:945: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/conda/envs/llava/lib/python3.10/site-packages/huggingface_hub/file_download.py:945: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/conda/envs/llava/lib/python3.10/site-packages/huggingface_hub/file_download.py:945: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/conda/envs/llava/lib/python3.10/site-packages/huggingface_hub/file_download.py:945: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/conda/envs/llava/lib/python3.10/site-packages/huggingface_hub/file_download.py:945: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
[2025-07-30 09:05:51,957] [INFO] [comm.py:637:init_distributed] cdb=None
・・・
[2025-07-30 09:05:52,005] [INFO] [comm.py:637:init_distributed] cdb=None
128fa4de0f78:9075:9571 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
128fa4de0f78:9075:9571 [0] NCCL INFO Failed to open libibverbs.so[.1]
128fa4de0f78:9075:9571 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
128fa4de0f78:9075:9571 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.10<0>
128fa4de0f78:9075:9571 [0] NCCL INFO Using network Socket
128fa4de0f78:9076:9627 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
128fa4de0f78:9076:9627 [1] NCCL INFO Failed to open libibverbs.so[.1]
128fa4de0f78:9076:9627 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
128fa4de0f78:9076:9627 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.10<0>
128fa4de0f78:9076:9627 [1] NCCL INFO Using network Socket
/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:103: FutureWarning: The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect
  warnings.warn(
/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:103: FutureWarning: The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect
  warnings.warn(
/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:103: FutureWarning: The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect
  warnings.warn(
/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:103: FutureWarning: The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect
  warnings.warn(
/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:103: FutureWarning: The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect
  warnings.warn(
/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:103: FutureWarning: The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect
  warnings.warn(
/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
・・・

128fa4de0f78:9082:9722 [7] NCCL INFO comm 0x30502590 rank 7 nranks 8 cudaDev 7 busId e1000 commId 0xb52205d78b0b952f - Init COMPLETE
128fa4de0f78:9075:9571 [0] NCCL INFO comm 0x1ec3f770 rank 0 nranks 8 cudaDev 0 busId 1000 commId 0xb52205d78b0b952f - Init COMPLETE
You are using a model of type siglip to instantiate a model of type clip_vision_model. This is not supported for all configurations of models and can yield errors.
・・・
You are using a model of type siglip to instantiate a model of type clip_vision_model. This is not supported for all configurations of models and can yield errors.
Rank 0:  Loading vision tower: checkpoints/google/siglip-so400m-patch14-384
You are using a model of type siglip to instantiate a model of type clip_vision_model. This is not supported for all configurations of models and can yield errors.
You are using a model of type siglip to instantiate a model of type clip_vision_model. This is not supported for all configurations of models and can yield errors.
Some weights of CLIPVisionModel were not initialized from the model checkpoint at checkpoints/google/siglip-so400m-patch14-384 and are newly initialized: ['vision_model.embeddings.class_embedding', 'vision_model.pre_layrnorm.bias', 'vision_model.pre_layrnorm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of CLIPVisionModel were not initialized from the model checkpoint at checkpoints/google/siglip-so400m-patch14-384 and are newly initialized: ['vision_model.embeddings.class_embedding', 'vision_model.pre_layrnorm.bias', 'vision_model.pre_layrnorm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of CLIPVisionModel were not initialized from the model checkpoint at checkpoints/google/siglip-so400m-patch14-384 and are newly initialized: ['vision_model.embeddings.class_embedding', 'vision_model.pre_layrnorm.bias', 'vision_model.pre_layrnorm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of CLIPVisionModel were not initialized from the model checkpoint at checkpoints/google/siglip-so400m-patch14-384 and are newly initialized: ['vision_model.embeddings.class_embedding', 'vision_model.pre_layrnorm.bias', 'vision_model.pre_layrnorm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of CLIPVisionModel were not initialized from the model checkpoint at checkpoints/google/siglip-so400m-patch14-384 and are newly initialized: ['vision_model.embeddings.class_embedding', 'vision_model.pre_layrnorm.bias', 'vision_model.pre_layrnorm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of CLIPVisionModel were not initialized from the model checkpoint at checkpoints/google/siglip-so400m-patch14-384 and are newly initialized: ['vision_model.embeddings.class_embedding', 'vision_model.pre_layrnorm.bias', 'vision_model.pre_layrnorm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of CLIPVisionModel were not initialized from the model checkpoint at checkpoints/google/siglip-so400m-patch14-384 and are newly initialized: ['vision_model.embeddings.class_embedding', 'vision_model.pre_layrnorm.bias', 'vision_model.pre_layrnorm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2025-07-30 09:06:07,606] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 876, num_elems = 14.77B
Traceback (most recent call last):
  File "/home/user/mnt/workspace/LLaVA-NeXT/llava/train/train_mem.py", line 4, in <module>
    train()
  File "/LLaVA-NeXT/llava/train/train.py", line 1486, in train
    model = get_model(model_args, training_args, bnb_model_from_pretrained_args)
  File "/LLaVA-NeXT/llava/train/train.py", line 1418, in get_model
    model = LlavaQwenForCausalLM.from_pretrained(
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3404, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/LLaVA-NeXT/llava/model/language_model/llava_qwen.py", line 55, in __init__
    self.model = LlavaQwenModel(config)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/LLaVA-NeXT/llava/model/language_model/llava_qwen.py", line 43, in __init__
    super(LlavaQwenModel, self).__init__(config)
  File "/LLaVA-NeXT/llava/model/llava_arch.py", line 41, in __init__
    self.vision_tower = build_vision_tower(config, delay_load=delay_load)
  File "/LLaVA-NeXT/llava/model/multimodal_encoder/builder.py", line 21, in build_vision_tower
    return CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/LLaVA-NeXT/llava/model/multimodal_encoder/clip_encoder.py", line 24, in __init__
    self.load_model()
  File "/LLaVA-NeXT/llava/model/multimodal_encoder/clip_encoder.py", line 41, in load_model
    self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name, device_map=device_map)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3531, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4009, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for CLIPVisionModel:
        size mismatch for vision_model.embeddings.patch_embedding.weight: copying a param with shape torch.Size([1152, 3, 14, 14]) from checkpoint, the shape in current model is torch.Size([768, 3, 32, 32]).
        size mismatch for vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([729, 1152]) from checkpoint, the shape in current model is torch.Size([50, 768]).
        size mismatch for vision_model.encoder.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([1152, 1152]) from checkpoint, the shape in current model is torch.Size([768, 768]).
・・・
        size mismatch for vision_model.post_layernorm.bias: copying a param with shape torch.Size([1152]) from checkpoint, the shape in current model is torch.Size([768]).
        You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
[2025-07-30 09:06:12,306] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 9076 closing signal SIGTERM
・・・
[2025-07-30 09:06:12,308] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 9082 closing signal SIGTERM
[2025-07-30 09:06:14,995] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 9075) of binary: /opt/conda/envs/llava/bin/python3.10
Traceback (most recent call last):
  File "/opt/conda/envs/llava/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
llava/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-07-30_09:06:12
  host      : 128fa4de0f78
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 9075)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Jul 30 '25 09:07 08D20088
LLaVA-NeXT LLaVA-NeXT copied to clipboard

Error when using locally downloaded weights of google/siglip-so400m-patch14-384 from Hugging Face

TL;DR

LLaVA-NeXT
LLaVA-NeXT copied to clipboard