stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

Blocked after typing the wandb options

Open Jeffwan opened this issue 1 year ago • 6 comments

I am blocked at this step.Seems it asked me to choose wandb and it's stucked after I type 3. No progress.. Is this expected?

root@5d83a2b86756:~/stanford_alpaca# torchrun --nproc_per_node=4 --master_port=3192 train.py     --model_name_or_path /root/models/llama_7B     --data_path ./alpaca_data.json     --bf16 True     --output_dir ./output     --num_train_epochs 1     --per_device_train_batch_size 4     --per_device_eval_batch_size 4     --gradient_accumulation_steps 8     --evaluation_strategy "no"     --save_strategy "steps"     --save_steps 2000     --save_total_limit 1     --learning_rate 2e-5     --weight_decay 0.     --warmup_ratio 0.03     --lr_scheduler_type "cosine"     --logging_steps 1     --fsdp "full_shard auto_wrap"     --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'     --tf32 True
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/root/transformers/src/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
  warnings.warn(
/root/transformers/src/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
  warnings.warn(
/root/transformers/src/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
  warnings.warn(
/root/transformers/src/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
  warnings.warn(
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00,  4.30s/it]
Using pad_token, but it is not set yet.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00,  4.36s/it]
Using pad_token, but it is not set yet.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00,  4.39s/it]
Using pad_token, but it is not set yet.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00,  4.40s/it]
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Loading data...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Formatting inputs...
WARNING:root:Loading data...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Tokenizing inputs... This may take some time...
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.14.0
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
  0%|                                           | 0/406 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(

nvidia-smi
Tue Mar 21 05:42:33 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:01:00.0 Off |                    0 |
| N/A   40C    P0    72W / 300W |  10391MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:23:00.0 Off |                    0 |
| N/A   41C    P0    75W / 300W |  10399MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   40C    P0    70W / 300W |  10399MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  On   | 00000000:E1:00.0 Off |                    0 |
| N/A   43C    P0    76W / 300W |  10359MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Jeffwan avatar Mar 21 '23 05:03 Jeffwan

Have you solved this problem?

danwei1992 avatar Mar 21 '23 06:03 danwei1992

@danwei1992 Not yet. Did you encounter the same problem?

Jeffwan avatar Mar 21 '23 06:03 Jeffwan

@danwei1992 Not yet. Did you encounter the same problem?

No, I also choose 3 but I got the problem about OOM with 8*v100 32G

danwei1992 avatar Mar 21 '23 06:03 danwei1992

Hi, Did you solve this problem? I met the same problem as yours. I use a machine with 8*p100 16G, it did not cause OOM.

VVNMA avatar Mar 21 '23 12:03 VVNMA

OOM error with 8*A100 40G... :(

goldmermaid avatar Mar 23 '23 00:03 goldmermaid

@VVNMA not yet. I will give another try tomorrow

Jeffwan avatar Mar 23 '23 07:03 Jeffwan

@VVNMA and other users who has exact same issues as me. here's the update.

Note: OOM issue could be a separate issue, let's talk about it in new threads

Seems it is the problem of wandb.. I tried to create an account and set wandb account env and training scripts works as expected!

echo 'export WANDB_PROJECT=alpaca' >> ~/.bashrc 
source ~/.bashrc
root@c35059eec1c7:~/stanford_alpaca# torchrun --nproc_per_node=4 --master_port=3192 train.py \
>     --model_name_or_path /root/llama-7b-hf \
>     --data_path ./alpaca_data.json \
>     --bf16 True \
>     --output_dir ./output \
>     --num_train_epochs 1 \
>     --per_device_train_batch_size 4 \
>     --per_device_eval_batch_size 4 \
>     --gradient_accumulation_steps 8 \
>     --evaluation_strategy "no" \
>     --save_strategy "steps" \
>     --save_steps 2000 \
>     --save_total_limit 1 \
>     --learning_rate 2e-5 \
>     --weight_decay 0. \
>     --warmup_ratio 0.03 \
>     --lr_scheduler_type "cosine" \
>     --logging_steps 1 \
>     --fsdp "full_shard auto_wrap" \
>     --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
>     --tf32 True
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************

/usr/local/lib/python3.10/dist-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
  warnings.warn(
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.04s/it]
Using pad_token, but it is not set yet.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00,  6.06s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00,  6.24s/it]
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:22<00:00, 11.29s/it]
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
wandb: Currently logged in as: jeffwan. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.14.0
wandb: Run data is saved locally in /root/stanford_alpaca/wandb/run-20230328_185005-ia7vez3w
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run vocal-flower-2
wandb: ⭐️ View project at https://wandb.ai/jeffwan/alpaca
wandb: 🚀 View run at https://wandb.ai/jeffwan/alpaca/runs/ia7vez3w
  0%|                                                                                                                                                                                                       | 0/406 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
  warnings.warn(
{'loss': 1.5039, 'learning_rate': 1.5384615384615387e-06, 'epoch': 0.0}
{'loss': 1.5785, 'learning_rate': 3.0769230769230774e-06, 'epoch': 0.0}
{'loss': 1.2977, 'learning_rate': 4.615384615384616e-06, 'epoch': 0.01}
{'loss': 1.1524, 'learning_rate': 6.153846153846155e-06, 'epoch': 0.01}
{'loss': 1.2287, 'learning_rate': 7.692307692307694e-06, 'epoch': 0.01}
  1%|██▎                                                                                                                                                                                          | 5/406 [01:48<2:16:05, 20.36s/it]

Jeffwan avatar Mar 28 '23 18:03 Jeffwan

For reference: I met the same problem. Turned out not the wandb issue. But the NCCL communication issue. export NCCL_P2P_DISABLE=1 solved my problem.

BinWang28 avatar Apr 20 '23 12:04 BinWang28