stanford_alpaca Training bug for 13b, 30b, and 65b

Has anyone been able to finetune any of the models larger than 7b successfully? I'm training on 8 A100s with 80GB of RAM each which is more than enough space.

The problem I'm running into is the first loss being massive (1e5) and subsequent losses being 0 after the first step. Not sure how to fix this or what is causing this, as the 7b model trains fine. I'm training with deepspeed launcher.

Here's an example of the output when training the 65b model.

  0%|          | 0/82 [00:00<?, ?it/s]
  1%|          | 1/82 [02:38<3:34:18, 158.74s/it]
  2%|▏         | 2/82 [04:39<3:01:49, 136.37s/it]
  4%|▎         | 3/82 [06:40<2:50:13, 129.28s/it]
  5%|▍         | 4/82 [08:39<2:43:07, 125.48s/it]
  6%|▌         | 5/82 [10:39<2:38:05, 123.19s/it]
  7%|▋         | 6/82 [12:39<2:34:54, 122.30s/it]
  9%|▊         | 7/82 [14:39<2:31:49, 121.46s/it]
 10%|▉         | 8/82 [16:38<2:28:55, 120.75s/it]
 11%|█         | 9/82 [18:38<2:26:39, 120.54s/it]
 12%|█▏        | 10/82 [20:38<2:24:23, 120.33s/it]
                                                  
{'loss': 121486.8, 'learning_rate': 0.0, 'epoch': 0.02}

 12%|█▏        | 10/82 [20:38<2:24:23, 120.33s/it]
 13%|█▎        | 11/82 [22:38<2:22:06, 120.10s/it]
 15%|█▍        | 12/82 [24:38<2:20:08, 120.12s/it]
 16%|█▌        | 13/82 [26:38<2:18:11, 120.17s/it]
 17%|█▋        | 14/82 [28:38<2:16:09, 120.15s/it]
 18%|█▊        | 15/82 [30:39<2:14:13, 120.21s/it]
 20%|█▉        | 16/82 [32:39<2:12:09, 120.15s/it]
 21%|██        | 17/82 [34:38<2:09:52, 119.89s/it]
 22%|██▏       | 18/82 [36:37<2:07:41, 119.71s/it]
 23%|██▎       | 19/82 [38:36<2:05:26, 119.47s/it]
 24%|██▍       | 20/82 [40:36<2:03:32, 119.55s/it]
                                                  
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.05}

 24%|██▍       | 20/82 [40:36<2:03:32, 119.55s/it]
 26%|██▌       | 21/82 [42:42<2:03:38, 121.61s/it]
 27%|██▋       | 22/82 [44:43<2:01:21, 121.36s/it]
 28%|██▊       | 23/82 [46:42<1:58:48, 120.81s/it]
 29%|██▉       | 24/82 [48:42<1:56:19, 120.34s/it]
 30%|███       | 25/82 [50:41<1:54:01, 120.03s/it]
 32%|███▏      | 26/82 [52:39<1:51:33, 119.53s/it]
 33%|███▎      | 27/82 [54:39<1:49:29, 119.44s/it]
 34%|███▍      | 28/82 [56:38<1:47:29, 119.43s/it]
 35%|███▌      | 29/82 [58:37<1:45:17, 119.20s/it]
 37%|███▋      | 30/82 [1:00:37<1:43:29, 119.42s/it]
                                                    
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.07}

Jun 16 '23 22:06 alexgshaw

My finetuning arguments:

    --model_name_or_path /home/ashaw8/compute/$MODEL_DIR/$MODEL_NAME \
    --data_path ./alpaca_data.json \
    --run_name $RUN_NAME \
    --bf16 True \
    --output_dir $OUTPUT_DIR \
    --logging_dir $LOGGING_DIR \
    --num_train_epochs 0.2 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy no \
    --save_strategy no \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate 1e-5 \
    --adam_beta1 0.9 \
    --adam_beta2 0.95 \
    --warmup_steps 50 \
    --lr_scheduler_type linear \
    --weight_decay 0.1 \
    --deepspeed ./configs/default_offload_opt_param.json \
    --tf32 True \
    --logging_strategy steps \
    --logging_steps 10 \
    --report_to wandb \

Jun 16 '23 22:06 alexgshaw

Same here. Could you share how did you slove this eventually? Thanks

Jun 23 '23 06:06 yh0903

are you able to train with batch size 4 as in the readme?

Jun 25 '23 02:06 yxchng

Haven't solved it yet, but switching from the huggingface trainer to pytorch lightning might solve the issue. If I can get it to work I'll post a link to a repo with everything set up.

Also, I switched to a different machine with V100s instead of A100s and 13b works on there. Could also be a version difference because I can work with docker containers on the V100 machine but only with venvs on the A100 machine (admins are stingy about root access).

Also, yes, I'm able to train with batch size of 4, but that does not make a difference.

Jun 26 '23 17:06 alexgshaw

It seems like this might be a related issue:

https://github.com/huggingface/transformers/issues/14531

I turned off bf16 and it fixed my issue with 13b and 30b. Without bf16 I can't fit 65b onto my GPUs so I haven't tested that one yet.

Any idea why bf16 is causing this problem? I think it's preventing the optimizer from stepping but have no idea why.

Jul 06 '23 22:07 alexgshaw

stanford_alpaca stanford_alpaca copied to clipboard

Training bug for 13b, 30b, and 65b

stanford_alpaca
stanford_alpaca copied to clipboard