stanford_alpaca
stanford_alpaca copied to clipboard
Sharing training log of 7B model on A6000 x 4
Mar20_05-17-08_0c56f6779a08.csv

Training command
torchrun --nproc_per_node=4 --master_port=34322 train.py \
--model_name_or_path {your-hf-lamma-path} \
--data_path ./alpaca_data.json \
--bf16 True \
--output_dir {your-output-dir} \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True
@SeungyounShin How long does it take? Can you also share the training logs?
I am blocked at this step..
root@5d83a2b86756:~/stanford_alpaca# torchrun --nproc_per_node=4 --master_port=3192 train.py --model_name_or_path /root/models/llama_7B --data_path ./alpaca_data.json --bf16 True --output_dir ./output --num_train_epochs 1 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' --tf32 True
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/root/transformers/src/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
warnings.warn(
/root/transformers/src/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
warnings.warn(
/root/transformers/src/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
warnings.warn(
/root/transformers/src/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
warnings.warn(
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00, 4.30s/it]
Using pad_token, but it is not set yet.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00, 4.36s/it]
Using pad_token, but it is not set yet.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00, 4.39s/it]
Using pad_token, but it is not set yet.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00, 4.40s/it]
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Loading data...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Formatting inputs...
WARNING:root:Loading data...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Tokenizing inputs... This may take some time...
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.14.0
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
0%| | 0/406 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
Seems it asked me to choose wandb and it's stucked after I type 3
. No progress..
nvidia-smi
Tue Mar 21 05:42:33 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06 Driver Version: 520.56.06 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... On | 00000000:01:00.0 Off | 0 |
| N/A 40C P0 72W / 300W | 10391MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... On | 00000000:23:00.0 Off | 0 |
| N/A 41C P0 75W / 300W | 10399MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100 80G... On | 00000000:41:00.0 Off | 0 |
| N/A 40C P0 70W / 300W | 10399MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100 80G... On | 00000000:E1:00.0 Off | 0 |
| N/A 43C P0 76W / 300W | 10359MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
I also posted https://github.com/tatsu-lab/stanford_alpaca/files/11024692/Mar20_05-17-08_0c56f6779a08.csv csv log file!
I takes approx. 24hours(a day)
This is strange. You are using way better gpus than mine. As you mentioned wandb could be the problem.
I also posted https://github.com/tatsu-lab/stanford_alpaca/files/11024692/Mar20_05-17-08_0c56f6779a08.csv csv log file!
I takes approx. 24hours(a day)
This is strange. You are using way better gpus than mine. As you mentioned wandb could be the problem.
I using 8*v100, and the same input train command with you, but it returns OOM, why?
--bf16 True
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1 \
@danwei1992
Make sure this 4 terms are correct
I am using
- docker image :hf:latest
- CUDA Version: 11.7
I apologize for the mistake I made in writing the model number. I had meant to write a6000 instead of v100. V100 with 32G of memory will not be sufficient to run batch1, as it requires 48G of memory which exceeds the capacity of the V100.
Hello bro,
can you help me little bit. I do not know why: https://github.com/tatsu-lab/stanford_alpaca/issues/116
Where can I find the pre-generated model?
I wonder if the code with the same config can be directly run on A6000*2?
I also posted https://github.com/tatsu-lab/stanford_alpaca/files/11024692/Mar20_05-17-08_0c56f6779a08.csv csv log file! I takes approx. 24hours(a day) This is strange. You are using way better gpus than mine. As you mentioned wandb could be the problem.
I using 8*v100, and the same input train command with you, but it returns OOM, why?
Hey, bro. I used the same GPUs as yours (4 * A6000), I wonder why I can't finetuned 7B which caused OOM on only 2 A6000? 7B is smaller enough compared with memory of A6000. So I don't know what caused this OOM problem.
I also posted https://github.com/tatsu-lab/stanford_alpaca/files/11024692/Mar20_05-17-08_0c56f6779a08.csv csv log file! I takes approx. 24hours(a day) This is strange. You are using way better gpus than mine. As you mentioned wandb could be the problem.
I using 8*v100, and the same input train command with you, but it returns OOM, why?
请问解决了吗?我和你有一样的问题
I use the same command on 4 x 48g a6000, I got OOM error