DeepSpeed
DeepSpeed copied to clipboard
[BUG] is_zero_init_model is always False when I'm using zero_init!
Describe the bug When I'm fine tuning llama2 with deepspeed zero3, I set "zero3_init_flag: true" in my accelerate config. The "is_deepspeed_zero3_enabled()" in transformers/integrations/deepspeed.py is also judged to True. But the "is_zero_init_model" is judged to False in _configure_distributed_model of deepspeed/runtime/engine.py. I'm not sure if it abnormal?
To Reproduce Here is my code:
from datasets import load_dataset
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, TrainingArguments
import bitsandbytes as bnb
from peft import LoraConfig
from trl import SFTTrainer
base_model_name ="/home/yangtong/data/llama2-hf/llama2-13b-chat_hf"
dataset = load_dataset("json",data_files="Belle_open_source_0.5M_changed.json",split="train")
result_dir = "tmp"
training_args = TrainingArguments(
report_to="none",
output_dir=result_dir,
# per_device_train_batch_size * gradient_accumulation_steps = batch_size
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
learning_rate=2e-4,
logging_steps=10,
# max_steps=520,
num_train_epochs=0.016,
save_steps=500,
bf16 = True, # set bf16 to True with an A100
# optim='paged_adamw_32bit',
gradient_checkpointing=True
)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
base_model = LlamaForCausalLM.from_pretrained(
base_model_name,
quantization_config=bnb_config,
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1
def find_all_linear_names(model):
cls = bnb.nn.Linear4bit
lora_module_names = set()
for name, module in model.named_modules():
if isinstance(module, cls):
names = name.split('.')
lora_module_names.add(names[0] if len(names) == 1 else names[-1])
if 'lm_head' in lora_module_names: # needed for 16-bit
lora_module_names.remove('lm_head')
return list(lora_module_names)
models=find_all_linear_names(base_model)
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=64,
bias="none",
task_type="CAUSAL_LM",
target_modules=models
)
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.deprecation_warnings["Asking-to-pad-a-fast-tokenizer"] = True
tokenizer.pad_token = tokenizer.eos_token
max_seq_length = 512
trainer = SFTTrainer(
model=base_model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_args
)
trainer.train()
output_dir = os.path.join(result_dir, "final_checkpoint")
trainer.model.save_pretrained(output_dir)
Here is my accelerate config:
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_config_file: /home/yangtong/ft_dis/ds_config/3.json
zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 4
rdzv_backend: 'c10d'
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Here is my deepspeed config:
{
"optimizer": {
"type": "AdamW",
"params": {
"lr": 2e-4,
"betas": [
0.9,
0.999
],
"eps": "auto",
"weight_decay": "auto",
"adam_w_mode": true,
"torch_adam": true
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true,
"overlap_comm": true,
"offload_optimizer": {
"device": "none",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"sub_group_size": 1e9,
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"bf16": {
"enabled": true
},
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 16,
"wall_clock_breakdown": false
}
Expected behavior Parameters first partition and then load to GPUs.
System info (please complete the following information):
- OS: Ubuntu 22.04.4 LTS (Linux 5.15.0-106-generic)
- GPU count and types 2 x Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
- Python version 3.10.13
- Pytorch version 2.2.2
- CUDA version 11.8.0
- bitsandbytes==0.43.0
- huggingface_hub==0.23.2
- accelerate==0.30.1
- transformers==4.41.1
- peft==0.9.0
- deepspeed==0.14.0
Launcher context
accelerate launch \
--config_file "config/z3_3.yaml" \
--num_processes 1 \
ft_acc.py
Here is engine.py:
I will truly appreciate if anyone can help me solve it ! @loadams @tjruwase @deepcharm
Maybe this link will help, https://huggingface.co/docs/transformers/main/en/deepspeed?models=pretrained+model#non-trainer-deepspeed-integration
Maybe this link will help, https://huggingface.co/docs/transformers/main/en/deepspeed?models=pretrained+model#non-trainer-deepspeed-integration
@Taiinguyenn139 Thanks for your reply! I have tried it but I lose. Here is my code, maybe it is not correct:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7'
from datasets import load_dataset
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, TrainingArguments
import bitsandbytes as bnb
from peft import LoraConfig
from trl import SFTTrainer
from accelerate import Accelerator
accelerator = Accelerator()
from transformers.integrations import HfDeepSpeedConfig
import deepspeed
ds_config = "ds_config/3.json"
dschf = HfDeepSpeedConfig(ds_config)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
base_model_name ="/home/yangtong/data/llama2-hf/llama2-13b-chat_hf"
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
quantization_config=bnb_config,
torch_dtype=torch.bfloat16
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1
# engine = deepspeed.initialize(model=base_model, config_params=ds_config)
dataset = load_dataset("json",data_files="Belle_open_source_0.5M_changed.json",split="train")
result_dir = "tmp"
training_args = TrainingArguments(
report_to="wandb",
output_dir=result_dir,
# per_device_train_batch_size * gradient_accumulation_steps = batch_size
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
learning_rate=2e-4,
logging_steps=10,
# max_steps=520,
num_train_epochs=0.037,
save_steps=500, # 65
bf16 = True,
# optim='paged_adamw_32bit',
gradient_checkpointing=True,
# group_by_length=True,
# remove_unused_columns=False,
# warmup_ratio=0.03,
# lr_scheduler_type='constant',
# max_grad_norm=0.3
)
models = ['v_proj', 'gate_proj', 'down_proj', 'k_proj', 'q_proj', 'o_proj', 'up_proj']
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=64,
bias="none",
task_type="CAUSAL_LM",
target_modules=models
)
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.deprecation_warnings["Asking-to-pad-a-fast-tokenizer"] = True
tokenizer.pad_token = tokenizer.eos_token
max_seq_length = 512
trainer = SFTTrainer(
model=base_model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_args
)
trainer.train()
output_dir = os.path.join(result_dir, "final_checkpoint")
trainer.model.save_pretrained(output_dir)
# trainer.save_model(output_dir) # Stage-3
I have some questions about the way your provide: (1) The situation in this link is "Non-Trainer DeepSpeed integration". I'm wondering I use SFTtrainer in my code, isn't it attribute to Trainer? (2) I'm using accelerate, and I set TrainingArguments before from_pretrained in my origin code refer to https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/deepspeed#constructing-massive-models:~:text=If%20you%20want%20to%20use%20a,is%20how%20example%20scripts%20are%20written.. Is necessery to set HfDeepSpeedConfig?
(1) In my experience, you can run ZeRO 3 with SFTrainer or Trainer (2) I dont use accelerate but I use deepspeed command like this
deepspeed train.py
You don't need to set HfDeepSpeedConfig
(3) To more clearly, ZeRO stage 3 won't shard your params because your are using QLoRA, as discussed in this post https://www.reddit.com/r/LocalLLaMA/comments/1ai5mv3/thoughts_on_qlora_with_fsdp/ It's just offload your params to CPU only. So is_zero_init_model is always False maybe expected behavior.
(1) In my experience, you can run ZeRO 3 with SFTrainer or Trainer (2) I dont use accelerate but I use deepspeed command like this
deepspeed train.pyYou don't need to set HfDeepSpeedConfig
(3) To more clearly, ZeRO stage 3 won't shard your params because your are using QLoRA, as discussed in this post https://www.reddit.com/r/LocalLLaMA/comments/1ai5mv3/thoughts_on_qlora_with_fsdp/ It's just offload your params to CPU only. So is_zero_init_model is always False maybe expected behavior.
@Taiinguyenn139 Thank you for your help! I'm using SFTrainer so I think I don't need HFDeepSpeedConfig. And using my origin code and command "accelerate launch --config_file "config/z3_3.yaml" --num_processes 1 ft_acc.py" is entirly equal to "deepspeed ft_acc.py" with "deepspeed="config_path"" added in TrainingArguments. And based on the link you provided, I try to use zero3+lora instead zero3+qlora (just remove bnb_config = BitsAndBytesConfig(...) ). Then it magically worked! Parameters first shard then load to each GPU! It looks like zero3_init don't support qlora, but except this link, I didn't search any information about that. Maybe I'll open another issue to ask this question. And I'll truly appreciate if you have any other info help me!
@Taiinguyenn139, thanks for helping to resolve this issue.
Closing this issue.