DeepSpeed [BUG] The training process is stuck without any report

Describe the bug When training llama2-7B, it gets stuck at the end of the first epoch without any bugs. ZeRO1, ZeRO3, fp16 and bf16 all ran into this problem. The hardware environment is 8 A100 40G. The frameworks used are transformers.

Screenshots the loss print

{'loss': 0.0138, 'learning_rate': 6.989640463132237e-05, 'epoch': 0.97}
{'loss': 0.0135, 'learning_rate': 6.95917123705058e-05, 'epoch': 0.98}
{'loss': 0.015, 'learning_rate': 6.928702010968921e-05, 'epoch': 0.98}
{'loss': 0.0117, 'learning_rate': 6.898232784887264e-05, 'epoch': 0.99}
the tqdm print, the total num of epoch is 3
33%|███▎      | 1126/3384 [05:27<10:53,  3.46it/s]
33%|███▎      | 1127/3384 [05:27<10:53,  3.45it/s]
33%|███▎      | 1128/3384 [05:28<10:53,  3.45it/s]

It will be stuck at these outputs. It's an end of first epoch trainning.

### Tasks

Sep 20 '23 04:09 sqzhang-lazy

@648652443 can you please provide a script to reproduce this error?

Sep 20 '23 15:09 mrwyattii

@648652443 can you please provide a script to reproduce this error?

just the script, right?

ds_config.json
{
    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
       "type": "WarmupDecayLR",
       "params": {
         "total_num_steps": "auto",
         "warmup_min_lr": "auto",
         "warmup_max_lr": "auto",
         "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 500,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

and the trainer argument and setting are:

training_args = Seq2SeqTrainingArguments(
            save_dir,
            do_train=True if args.evaluate_dir is None else False,
            do_eval=False,
            warmup_ratio=args.warmup_ratio,
            evaluation_strategy="no",
            logging_strategy="steps",
            logging_dir="./output_log",
            logging_steps=10,
            save_strategy="epoch",
            save_total_limit = 2,
            learning_rate= args.lr,
            eval_accumulation_steps=args.eval_acc,
            per_device_train_batch_size=args.bs,
            per_device_eval_batch_size=args.eval_bs,
            weight_decay=0.01,
            num_train_epochs=args.epoch,
            predict_with_generate=args.use_generate,
            generation_max_length=args.output_len,
            report_to="none",
            local_rank=args.local_rank
        )
trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=train_set,
        eval_dataset=None,
        data_collator=data_collator,
        tokenizer=tokenizer,
        # compute_metrics = compute_metrics_rouge
    )

and run the deepseed script:

deepspeed --num_gpus=8 main_central_llama_acclerator.py \
    --data_root gpt \
    --model ./pretrained-models/llama2-7b \
    --epoch 3 --lr 1e-4 \
    --user_msg llama2_all0.1 \
    --bs 2 --eval_bs 1 --input_len 1536 --output_len 1664 \
    --transform_axis --warmup_ratio 0.03 \
    --all_data 0.01 \
    --eval_subset dataset/general_gpt \
    --output_dir experiments/llama
    --deepspeed ds_config.json

Sep 21 '23 14:09 sqzhang-lazy

same issue here, it gets stuck at the beginning of the training:

Loading extension module cpu_adam...
Time to load cpu_adam op: 37.60238528251648 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 37.519699811935425 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 37.619828939437866 seconds
Parameter Offload: Total persistent parameters: 11800576 in 417 params
  0%|                                                                               | 0/6 [00:00<?, ?it/s]
 You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  0%|                                                                               | 0/6 [00:02<?, ?it/s]
  0%|                                                                               | 0/12 [00:00<?, ?it/s]

any solution please! @mrwyattii

Oct 03 '23 08:10 Aillian

same question.

Dec 16 '23 07:12 LinB203