LLaVA
LLaVA copied to clipboard
Cannot resume from checkpoint when 8-bit LoRA fine-tuning with deepspeed zero2_offload
Describe the issue
Issue:
The error below occurs when trying to resume 8-bit LoRA fine-tuning using deepspeed zero2_offload. Any help to resolve this would be really appreciated. Note that the checkpoint at ./checkpoints/output_dir/checkpoint-10000
exists when running this command. Thank you!
Command:
HF_DATASETS_OFFLINE=0 TRANSFORMERS_OFFLINE=0 deepspeed llava/train/train.py \
--bits 8 \
--lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
--deepspeed ./scripts/zero2_offload.json \
--model_name_or_path lmsys/vicuna-7b-v1.5 \
--version v1 \
--data_path ./path/to/our/data.json \
--image_folder ./path/to/image/folder \
--vision_tower openai/clip-vit-large-patch14-336 \
--pretrain_mm_mlp_adapter ./checkpoints/llava-v1.5-7b-pretrain/mm_projector.bin \
--mm_projector_type mlp2x_gelu \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--image_aspect_ratio pad \
--group_by_modality_length True \
--bf16 True \
--output_dir ./checkpoints/output_dir \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 10000 \
--save_total_limit 1 \
--learning_rate 2e-4 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 48
--lazy_preprocess True \
--report_to wandb
Log:
Traceback (most recent call last):
File "/home/username/LLaVA/llava/train/train.py", line 1165, in <module>
train()
File "/home/username/LLaVA/llava/train/train.py", line 1141, in train
trainer.train(resume_from_checkpoint=True)
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/transformers/trainer.py", line 1676, in _inner_training_loop
deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/transformers/deepspeed.py", line 383, in deepspeed_load_checkpoint
load_path, _ = deepspeed_engine.load_checkpoint(
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2752, in load_checkpoint
load_path, client_states = self._load_checkpoint(load_dir,
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2837, in _load_checkpoint
self.load_module_state_dict(checkpoint=checkpoint,
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2615, in load_module_state_dict
self.module.load_state_dict(
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2139, in load_state_dict
load(self, state_dict)
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2127, in load
load(child, child_state_dict, child_prefix)
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2127, in load
load(child, child_state_dict, child_prefix)
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2127, in load
load(child, child_state_dict, child_prefix)
[Previous line repeated 5 more times]
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2121, in load
module._load_from_state_dict(
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 406, in _load_from_state_dict
super()._load_from_state_dict(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys,
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1991, in _load_from_state_dict
hook(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/torch/nn/modules/module.py", line 72, in __call__
return self.hook(*args, **kwargs)
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 356, in maybe_rearrange_weight
tile_indices = get_tile_inds(weight_format, weight.device)
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 247, in get_tile_inds
return get_inverse_transform_indices(transform, _get_tile_size(format)).to(device)
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 79, in get_inverse_transform_indices
permuted_tile_i = transform_tile(sample_tile_i)
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 245, in <lambda>
transform = lambda x: F.transform(x.to(device), from_order="row", to_order=format)[0].to(x.device)
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/bitsandbytes/functional.py", line 2080, in transform
prev_device = pre_call(A.device)
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/bitsandbytes/functional.py", line 416, in pre_call
torch.cuda.set_device(device)
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/torch/cuda/__init__.py", line 406, in set_device
device = _get_device_index(device)
File "/home/username/miniconda3/envs/llava/lib/python3.9/site-packages/torch/cuda/_utils.py", line 34, in _get_device_index
raise ValueError(f"Expected a cuda device, but got: {device}")
ValueError: Expected a cuda device, but got: cpu
try to use zero2?
try to use zero2?
Same error with this unfortunately!
Did anyone got the solution ? I'm stuck for the past few days !!
deepspeed llava/train/train_mem.py
--lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5
--deepspeed ./scripts/zero3.json
--model_name_or_path /cache/hub/llava-v1.5-7b
--version llava_llama_2
--data_path /sample.json
--image_folder /images
--vision_tower openai/clip-vit-large-patch14-336
--mm_projector_type mlp2x_gelu
--mm_vision_select_layer -2
--mm_use_im_start_end False
--mm_use_im_patch_token False
--image_aspect_ratio pad
--group_by_modality_length True
--bf16 True
--output_dir /zero3-llava-lora-sample
--num_train_epochs 10
--per_device_train_batch_size 8
--per_device_eval_batch_size 8
--gradient_accumulation_steps 8
--evaluation_strategy 'no'
--save_strategy 'epoch'
--save_total_limit 1
--learning_rate 2e-4
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 10
--tf32 True
--model_max_length 2048
--gradient_checkpointing True
--dataloader_num_workers 4
--lazy_preprocess True
--report_to wandb