stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

Sharing training log of 7B model on A6000 x 4

Open SeungyounShin opened this issue 1 year ago • 10 comments

Mar20_05-17-08_0c56f6779a08.csv

Screen Shot 2023-03-21 at 12 10 42 PM

Training command

torchrun --nproc_per_node=4 --master_port=34322 train.py \
    --model_name_or_path {your-hf-lamma-path} \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir {your-output-dir}  \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True

SeungyounShin avatar Mar 21 '23 03:03 SeungyounShin

@SeungyounShin How long does it take? Can you also share the training logs?

I am blocked at this step..

root@5d83a2b86756:~/stanford_alpaca# torchrun --nproc_per_node=4 --master_port=3192 train.py     --model_name_or_path /root/models/llama_7B     --data_path ./alpaca_data.json     --bf16 True     --output_dir ./output     --num_train_epochs 1     --per_device_train_batch_size 4     --per_device_eval_batch_size 4     --gradient_accumulation_steps 8     --evaluation_strategy "no"     --save_strategy "steps"     --save_steps 2000     --save_total_limit 1     --learning_rate 2e-5     --weight_decay 0.     --warmup_ratio 0.03     --lr_scheduler_type "cosine"     --logging_steps 1     --fsdp "full_shard auto_wrap"     --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'     --tf32 True
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/root/transformers/src/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
  warnings.warn(
/root/transformers/src/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
  warnings.warn(
/root/transformers/src/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
  warnings.warn(
/root/transformers/src/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
  warnings.warn(
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00,  4.30s/it]
Using pad_token, but it is not set yet.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00,  4.36s/it]
Using pad_token, but it is not set yet.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00,  4.39s/it]
Using pad_token, but it is not set yet.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00,  4.40s/it]
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Loading data...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Formatting inputs...
WARNING:root:Loading data...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Tokenizing inputs... This may take some time...
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.14.0
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
  0%|                                                                                                                                                                                            | 0/406 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(

Seems it asked me to choose wandb and it's stucked after I type 3. No progress..

nvidia-smi
Tue Mar 21 05:42:33 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:01:00.0 Off |                    0 |
| N/A   40C    P0    72W / 300W |  10391MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:23:00.0 Off |                    0 |
| N/A   41C    P0    75W / 300W |  10399MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   40C    P0    70W / 300W |  10399MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  On   | 00000000:E1:00.0 Off |                    0 |
| N/A   43C    P0    76W / 300W |  10359MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Jeffwan avatar Mar 21 '23 05:03 Jeffwan

I also posted https://github.com/tatsu-lab/stanford_alpaca/files/11024692/Mar20_05-17-08_0c56f6779a08.csv csv log file!

I takes approx. 24hours(a day)

This is strange. You are using way better gpus than mine. As you mentioned wandb could be the problem.

SeungyounShin avatar Mar 21 '23 05:03 SeungyounShin

I also posted https://github.com/tatsu-lab/stanford_alpaca/files/11024692/Mar20_05-17-08_0c56f6779a08.csv csv log file!

I takes approx. 24hours(a day)

This is strange. You are using way better gpus than mine. As you mentioned wandb could be the problem.

I using 8*v100, and the same input train command with you, but it returns OOM, why?

danwei1992 avatar Mar 21 '23 06:03 danwei1992

--bf16 True
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1 \

@danwei1992

Make sure this 4 terms are correct

I am using

I apologize for the mistake I made in writing the model number. I had meant to write a6000 instead of v100. V100 with 32G of memory will not be sufficient to run batch1, as it requires 48G of memory which exceeds the capacity of the V100.

SeungyounShin avatar Mar 21 '23 07:03 SeungyounShin

Hello bro,

can you help me little bit. I do not know why: https://github.com/tatsu-lab/stanford_alpaca/issues/116

ZeyuTeng96 avatar Mar 21 '23 13:03 ZeyuTeng96

Where can I find the pre-generated model?

eric0fw avatar Mar 23 '23 23:03 eric0fw

I wonder if the code with the same config can be directly run on A6000*2?

chaojiewang94 avatar Mar 30 '23 03:03 chaojiewang94

I also posted https://github.com/tatsu-lab/stanford_alpaca/files/11024692/Mar20_05-17-08_0c56f6779a08.csv csv log file! I takes approx. 24hours(a day) This is strange. You are using way better gpus than mine. As you mentioned wandb could be the problem.

I using 8*v100, and the same input train command with you, but it returns OOM, why?

Hey, bro. I used the same GPUs as yours (4 * A6000), I wonder why I can't finetuned 7B which caused OOM on only 2 A6000? 7B is smaller enough compared with memory of A6000. So I don't know what caused this OOM problem.

luoxindi avatar Mar 31 '23 08:03 luoxindi

I also posted https://github.com/tatsu-lab/stanford_alpaca/files/11024692/Mar20_05-17-08_0c56f6779a08.csv csv log file! I takes approx. 24hours(a day) This is strange. You are using way better gpus than mine. As you mentioned wandb could be the problem.

I using 8*v100, and the same input train command with you, but it returns OOM, why?

请问解决了吗?我和你有一样的问题

qwjaskzxl avatar Apr 27 '23 15:04 qwjaskzxl

I use the same command on 4 x 48g a6000, I got OOM error

yyhycx avatar Jul 05 '23 08:07 yyhycx