DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Single node multi card training failed
(deepspeed) [menkeyi@gpu1 DeepSpeed-Chat]$ python train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node ---=== Running Step 1 ===--- Running: bash /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_node/run_13b.sh /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/13b
GPU usage rate: (deepspeed) [menkeyi@gpu1 DeepSpeed-Chat]$ nvidia-smi Sat Apr 15 15:02:38 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100 80G... Off | 00000000:4F:00.0 Off | 0 | | N/A 35C P0 71W / 300W | 1015MiB / 80994MiB | 100% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100 80G... Off | 00000000:52:00.0 Off | 0 | | N/A 36C P0 69W / 300W | 1019MiB / 80994MiB | 100% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A100 80G... Off | 00000000:56:00.0 Off | 0 | | N/A 35C P0 67W / 300W | 1019MiB / 80994MiB | 100% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A100 80G... Off | 00000000:57:00.0 Off | 0 | | N/A 37C P0 69W / 300W | 1019MiB / 80994MiB | 100% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA A100 80G... Off | 00000000:CE:00.0 Off | 0 | | N/A 36C P0 69W / 300W | 1019MiB / 80994MiB | 100% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA A100 80G... Off | 00000000:D1:00.0 Off | 0 | | N/A 37C P0 70W / 300W | 1019MiB / 80994MiB | 100% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 6 NVIDIA A100 80G... Off | 00000000:D5:00.0 Off | 0 | | N/A 38C P0 70W / 300W | 1019MiB / 80994MiB | 100% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 7 NVIDIA A100 80G... Off | 00000000:D6:00.0 Off | 0 | | N/A 40C P0 74W / 300W | 999MiB / 80994MiB | 100% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+
(deepspeed) [menkeyi@gpu1 DeepSpeed-Chat]$ tail -f output/actor-models/13b/training.log [2023-04-15 14:56:08,436] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-15 14:56:08,601] [INFO] [runner.py:540:main] cmd = /home/menkeyi/.conda/envs/deepspeed/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets openai/webgpt_comparisons stanfordnlp/SHP --data_split 2,4,4 --model_name_or_path facebook/opt-13b --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --max_seq_len 512 --learning_rate 1e-4 --weight_decay 0.1 --num_train_epochs 2 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --gradient_checkpointing --zero_stage 3 --lora_dim 128 --lora_module_name decoder.layers. --deepspeed --output_dir /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/13b [2023-04-15 14:56:13,294] [INFO] [launch.py:222:main] 0 NCCL_P2P_LEVEL=SYS [2023-04-15 14:56:13,294] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2023-04-15 14:56:13,294] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0 [2023-04-15 14:56:13,294] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2023-04-15 14:56:13,294] [INFO] [launch.py:247:main] dist_world_size=8 [2023-04-15 14:56:13,294] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 [2023-04-15 14:56:28,848] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
gpu1:5025:5384 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer gpu1:5022:5377 [1] NCCL INFO comm 0x4438f540 rank 1 nranks 8 cudaDev 1 busId 52000 - Init COMPLETE gpu1:5024:5388 [3] NCCL INFO comm 0x4495bb80 rank 3 nranks 8 cudaDev 3 busId 57000 - Init COMPLETE gpu1:5021:5376 [0] NCCL INFO comm 0x453b9880 rank 0 nranks 8 cudaDev 0 busId 4f000 - Init COMPLETE gpu1:5023:5381 [2] NCCL INFO comm 0x44c39ee0 rank 2 nranks 8 cudaDev 2 busId 56000 - Init COMPLETE gpu1:5025:5384 [4] NCCL INFO comm 0x43b40770 rank 4 nranks 8 cudaDev 4 busId ce000 - Init COMPLETE gpu1:5027:5380 [6] NCCL INFO comm 0x44103030 rank 6 nranks 8 cudaDev 6 busId d5000 - Init COMPLETE gpu1:5026:5379 [5] NCCL INFO comm 0x452d84a0 rank 5 nranks 8 cudaDev 5 busId d1000 - Init COMPLETE gpu1:5028:5378 [7] NCCL INFO comm 0x443814b0 rank 7 nranks 8 cudaDev 7 busId d6000 - Init COMPLETE
gpu1:5021:5386 [0] transport/net_ib.cc:93 NCCL WARN NET/IB : Got async event : client reregistration
gpu1:5023:5387 [0] transport/net_ib.cc:93 NCCL WARN NET/IB : Got async event : client reregistration
gpu1:5022:5382 [0] transport/net_ib.cc:93 NCCL WARN NET/IB : Got async event : client reregistration
gpu1:5024:5391 [0] transport/net_ib.cc:93 NCCL WARN NET/IB : Got async event : client reregistration
gpu1:5026:5383 [0] transport/net_ib.cc:93 NCCL WARN NET/IB : Got async event : client reregistration
gpu1:5028:5385 [0] transport/net_ib.cc:93 NCCL WARN NET/IB : Got async event : client reregistration
gpu1:5025:5390 [0] transport/net_ib.cc:93 NCCL WARN NET/IB : Got async event : client reregistration
gpu1:5027:5389 [0] transport/net_ib.cc:93 NCCL WARN NET/IB : Got async event : client reregistration