[BUG] The training process is stuck without any report
Describe the bug When training llama2-7B, it gets stuck at the end of the first epoch without any bugs. ZeRO1, ZeRO3, fp16 and bf16 all ran into this problem. The hardware environment is 8 A100 40G. The frameworks used are transformers.
Screenshots the loss print
{'loss': 0.0138, 'learning_rate': 6.989640463132237e-05, 'epoch': 0.97}
{'loss': 0.0135, 'learning_rate': 6.95917123705058e-05, 'epoch': 0.98}
{'loss': 0.015, 'learning_rate': 6.928702010968921e-05, 'epoch': 0.98}
{'loss': 0.0117, 'learning_rate': 6.898232784887264e-05, 'epoch': 0.99}
the tqdm print, the total num of epoch is 3
33%|███▎ | 1126/3384 [05:27<10:53, 3.46it/s]
33%|███▎ | 1127/3384 [05:27<10:53, 3.45it/s]
33%|███▎ | 1128/3384 [05:28<10:53, 3.45it/s]
It will be stuck at these outputs. It's an end of first epoch trainning.
### Tasks
@648652443 can you please provide a script to reproduce this error?
@648652443 can you please provide a script to reproduce this error?
just the script, right?
ds_config.json
{
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"total_num_steps": "auto",
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 500,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
and the trainer argument and setting are:
training_args = Seq2SeqTrainingArguments(
save_dir,
do_train=True if args.evaluate_dir is None else False,
do_eval=False,
warmup_ratio=args.warmup_ratio,
evaluation_strategy="no",
logging_strategy="steps",
logging_dir="./output_log",
logging_steps=10,
save_strategy="epoch",
save_total_limit = 2,
learning_rate= args.lr,
eval_accumulation_steps=args.eval_acc,
per_device_train_batch_size=args.bs,
per_device_eval_batch_size=args.eval_bs,
weight_decay=0.01,
num_train_epochs=args.epoch,
predict_with_generate=args.use_generate,
generation_max_length=args.output_len,
report_to="none",
local_rank=args.local_rank
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=train_set,
eval_dataset=None,
data_collator=data_collator,
tokenizer=tokenizer,
# compute_metrics = compute_metrics_rouge
)
and run the deepseed script:
deepspeed --num_gpus=8 main_central_llama_acclerator.py \
--data_root gpt \
--model ./pretrained-models/llama2-7b \
--epoch 3 --lr 1e-4 \
--user_msg llama2_all0.1 \
--bs 2 --eval_bs 1 --input_len 1536 --output_len 1664 \
--transform_axis --warmup_ratio 0.03 \
--all_data 0.01 \
--eval_subset dataset/general_gpt \
--output_dir experiments/llama
--deepspeed ds_config.json
same issue here, it gets stuck at the beginning of the training:
Loading extension module cpu_adam...
Time to load cpu_adam op: 37.60238528251648 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 37.519699811935425 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 37.619828939437866 seconds
Parameter Offload: Total persistent parameters: 11800576 in 417 params
0%| | 0/6 [00:00<?, ?it/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
0%| | 0/6 [00:02<?, ?it/s]
0%| | 0/12 [00:00<?, ?it/s]
any solution please! @mrwyattii
same question.