stanford_alpaca
stanford_alpaca copied to clipboard
Blocked after typing the wandb options
I am blocked at this step.Seems it asked me to choose wandb and it's stucked after I type 3
. No progress.. Is this expected?
root@5d83a2b86756:~/stanford_alpaca# torchrun --nproc_per_node=4 --master_port=3192 train.py --model_name_or_path /root/models/llama_7B --data_path ./alpaca_data.json --bf16 True --output_dir ./output --num_train_epochs 1 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' --tf32 True
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/root/transformers/src/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
warnings.warn(
/root/transformers/src/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
warnings.warn(
/root/transformers/src/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
warnings.warn(
/root/transformers/src/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
warnings.warn(
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00, 4.30s/it]
Using pad_token, but it is not set yet.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00, 4.36s/it]
Using pad_token, but it is not set yet.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00, 4.39s/it]
Using pad_token, but it is not set yet.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00, 4.40s/it]
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Loading data...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Formatting inputs...
WARNING:root:Loading data...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Tokenizing inputs... This may take some time...
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.14.0
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
0%| | 0/406 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
nvidia-smi
Tue Mar 21 05:42:33 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06 Driver Version: 520.56.06 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... On | 00000000:01:00.0 Off | 0 |
| N/A 40C P0 72W / 300W | 10391MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... On | 00000000:23:00.0 Off | 0 |
| N/A 41C P0 75W / 300W | 10399MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100 80G... On | 00000000:41:00.0 Off | 0 |
| N/A 40C P0 70W / 300W | 10399MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100 80G... On | 00000000:E1:00.0 Off | 0 |
| N/A 43C P0 76W / 300W | 10359MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Have you solved this problem?
@danwei1992 Not yet. Did you encounter the same problem?
@danwei1992 Not yet. Did you encounter the same problem?
No, I also choose 3 but I got the problem about OOM with 8*v100 32G
Hi, Did you solve this problem? I met the same problem as yours. I use a machine with 8*p100 16G, it did not cause OOM.
OOM error with 8*A100 40G... :(
@VVNMA not yet. I will give another try tomorrow
@VVNMA and other users who has exact same issues as me. here's the update.
Note: OOM issue could be a separate issue, let's talk about it in new threads
Seems it is the problem of wandb.. I tried to create an account and set wandb account env and training scripts works as expected!
echo 'export WANDB_PROJECT=alpaca' >> ~/.bashrc
source ~/.bashrc
root@c35059eec1c7:~/stanford_alpaca# torchrun --nproc_per_node=4 --master_port=3192 train.py \
> --model_name_or_path /root/llama-7b-hf \
> --data_path ./alpaca_data.json \
> --bf16 True \
> --output_dir ./output \
> --num_train_epochs 1 \
> --per_device_train_batch_size 4 \
> --per_device_eval_batch_size 4 \
> --gradient_accumulation_steps 8 \
> --evaluation_strategy "no" \
> --save_strategy "steps" \
> --save_steps 2000 \
> --save_total_limit 1 \
> --learning_rate 2e-5 \
> --weight_decay 0. \
> --warmup_ratio 0.03 \
> --lr_scheduler_type "cosine" \
> --logging_steps 1 \
> --fsdp "full_shard auto_wrap" \
> --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
> --tf32 True
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/usr/local/lib/python3.10/dist-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
warnings.warn(
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 5.04s/it]
Using pad_token, but it is not set yet.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.06s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.24s/it]
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:22<00:00, 11.29s/it]
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
wandb: Currently logged in as: jeffwan. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.14.0
wandb: Run data is saved locally in /root/stanford_alpaca/wandb/run-20230328_185005-ia7vez3w
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run vocal-flower-2
wandb: ⭐️ View project at https://wandb.ai/jeffwan/alpaca
wandb: 🚀 View run at https://wandb.ai/jeffwan/alpaca/runs/ia7vez3w
0%| | 0/406 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
warnings.warn(
{'loss': 1.5039, 'learning_rate': 1.5384615384615387e-06, 'epoch': 0.0}
{'loss': 1.5785, 'learning_rate': 3.0769230769230774e-06, 'epoch': 0.0}
{'loss': 1.2977, 'learning_rate': 4.615384615384616e-06, 'epoch': 0.01}
{'loss': 1.1524, 'learning_rate': 6.153846153846155e-06, 'epoch': 0.01}
{'loss': 1.2287, 'learning_rate': 7.692307692307694e-06, 'epoch': 0.01}
1%|██▎ | 5/406 [01:48<2:16:05, 20.36s/it]
For reference:
I met the same problem. Turned out not the wandb issue. But the NCCL communication issue.
export NCCL_P2P_DISABLE=1
solved my problem.