axolotl icon indicating copy to clipboard operation
axolotl copied to clipboard

multi gpu - transformers/modeling_utils.py - Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.

Open manishiitg opened this issue 5 months ago • 3 comments

Please check that this issue hasn't been reported before.

  • [X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

should work

Current behaviour

Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 59, in <module>
    fire.Fire(do_cli)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
    return do_train(parsed_cfg, parsed_cli_args)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 55, in do_train
    return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
  File "/workspace/axolotl/src/axolotl/train.py", line 80, in train
    model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
  File "/workspace/axolotl/src/axolotl/utils/models.py", line 624, in load_model
    raise err
  File "/workspace/axolotl/src/axolotl/utils/models.py", line 585, in load_model
    model = getattr(transformers, model_type).from_pretrained(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3504, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3924, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 310, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
[2024-02-01 07:03:05,572] [ERROR] [axolotl.load_model:623] [PID:77] [RANK:0] Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.
Traceback (most recent call last):
  File "/workspace/axolotl/src/axolotl/utils/models.py", line 585, in load_model
    model = getattr(transformers, model_type).from_pretrained(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3504, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3924, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 310, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.

Steps to reproduce

!docker run --gpus all
-v /root/.cache:/root/.cache
-v /home/gcpuser/sky_workdir:/sky_workdir
winglian/axolotl:main-py3.10-cu118-2.0.1
accelerate launch -m axolotl.cli.train /sky_workdir/hi-qlora-hi-2.yaml --deepspeed /sky_workdir/zero3_bf16.json

Config yaml

base_model: teknium/OpenHermes-2.5-Mistral-7B model_type: MistralForCausalLM tokenizer_type: LlamaTokenizer is_mistral_derived_model: true

load_in_8bit: false load_in_4bit: true strict: false

chat_template: chatml datasets:

  • path: manishiitg/chat-instruct-hi-v4 type: completion

hub_model_id: manishiitg/open-aditi-chat-hi-1.5 hf_use_auth_token: true

wandb_project: open-aditi-chat-hi-1.5

dataset_prepared_path: manishiitg push_dataset_to_hub: manishiitg val_set_size: 0 output_dir: /sky-notebook/manishiitg/open-aditi-chat-hi-1.5

adapter: qlora lora_model_dir: save_safetensors: true

sequence_len: 4096 sample_packing: true pad_to_sequence_len: true

lora_r: 16 lora_alpha: 32 lora_dropout: 0.05 lora_target_linear: true lora_fan_in_fan_out: lora_target_modules:

  • gate_proj
  • down_proj
  • up_proj
  • q_proj
  • v_proj
  • k_proj
  • o_proj

lora_modules_to_save:

  • embed_tokens
  • lm_head

wandb_entity: wandb_watch: wandb_run_id: wandb_log_model:

gradient_accumulation_steps: 4 micro_batch_size: 9 num_epochs: 3 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002

adam_beta2: 0.95 adam_epsilon: 0.00001 max_grad_norm: 1.0

train_on_inputs: false group_by_length: false bf16: true fp16: false tf32: false

gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: auto_resume_from_checkpoints: true ## manage check point resume from here local_rank: logging_steps: 1 xformers_attention: flash_attention: true

warmup_steps: 10 eval_steps: 0 eval_table_size: eval_table_max_new_tokens: 128 save_steps: 20 ## increase based on your dataset save_strategy: steps debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens: bos_token: "" eos_token: "" unk_token: "" tokens: # these are delimiters

  • "<|im_start|>"
  • "<|im_end|>"

Possible solution

No response

Which Operating Systems are you using?

  • [X] Linux
  • [ ] macOS
  • [ ] Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

  • [X] My issue title is concise, descriptive, and in title casing.
  • [X] I have searched the existing issues to make sure this bug has not been reported yet.
  • [X] I am using the latest version of axolotl.
  • [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

manishiitg avatar Feb 01 '24 07:02 manishiitg