accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Does not proceed after accelerator.prepare

Open zzoliman opened this issue 2 years ago • 1 comments

Hi, I am trying to train a language model using the run_mlm_no_trainer.py script, using multi-gpus. However, I get stuck in the accelerator.prepare method (below is the code line where I get stuck).

model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare( model, optimizer, train_dataloader, eval_dataloader, lr_scheduler )

Actually, it gets stuck in accelerator.wait_for_everyone() as well, but I removed that part and now I get stuck in prepare().

I used NCCL_DEBUG=INFO to get debugging information and below is the NCCL INFO that I get when the prepare() code is run. Could you help me with this problem?

scc-f05:9002:9002 [0] NCCL INFO Bootstrap : Using ens106f0:192.168.17.192<0>
scc-f05:9002:9002 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation scc-f05:9002:9002 [0] NCCL INFO cudaDriverVersion 12000 NCCL version 2.14.3+cuda11.7 scc-f05:9007:9007 [5] NCCL INFO cudaDriverVersion 12000 scc-f05:9008:9008 [6] NCCL INFO cudaDriverVersion 12000 scc-f05:9003:9003 [1] NCCL INFO cudaDriverVersion 12000 scc-f05:9009:9009 [7] NCCL INFO cudaDriverVersion 12000 scc-f05:9006:9006 [4] NCCL INFO cudaDriverVersion 12000 scc-f05:9008:9008 [6] NCCL INFO Bootstrap : Using ens106f0:192.168.17.192<0> scc-f05:9008:9008 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation scc-f05:9007:9007 [5] NCCL INFO Bootstrap : Using ens106f0:192.168.17.192<0> scc-f05:9003:9003 [1] NCCL INFO Bootstrap : Using ens106f0:192.168.17.192<0> scc-f05:9007:9007 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation scc-f05:9003:9003 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation scc-f05:9009:9009 [7] NCCL INFO Bootstrap : Using ens106f0:192.168.17.192<0> scc-f05:9009:9009 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation scc-f05:9006:9006 [4] NCCL INFO Bootstrap : Using ens106f0:192.168.17.192<0> scc-f05:9006:9006 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation scc-f05:9002:9690 [0] NCCL INFO NET/IB : No device found. scc-f05:9002:9690 [0] NCCL INFO NET/Socket : Using [0]ens106f0:192.168.17.192<0> scc-f05:9002:9690 [0] NCCL INFO Using network Socket scc-f05:9008:9691 [6] NCCL INFO NET/IB : No device found. scc-f05:9008:9691 [6] NCCL INFO NET/Socket : Using [0]ens106f0:192.168.17.192<0> scc-f05:9008:9691 [6] NCCL INFO Using network Socket scc-f05:9007:9692 [5] NCCL INFO NET/IB : No device found. scc-f05:9003:9693 [1] NCCL INFO NET/IB : No device found. scc-f05:9003:9693 [1] NCCL INFO NET/Socket : Using [0]ens106f0:192.168.17.192<0> scc-f05:9007:9692 [5] NCCL INFO NET/Socket : Using [0]ens106f0:192.168.17.192<0> scc-f05:9003:9693 [1] NCCL INFO Using network Socket scc-f05:9007:9692 [5] NCCL INFO Using network Socket scc-f05:9006:9695 [4] NCCL INFO NET/IB : No device found. scc-f05:9009:9694 [7] NCCL INFO NET/IB : No device found. scc-f05:9006:9695 [4] NCCL INFO NET/Socket : Using [0]ens106f0:192.168.17.192<0> scc-f05:9006:9695 [4] NCCL INFO Using network Socket scc-f05:9009:9694 [7] NCCL INFO NET/Socket : Using [0]ens106f0:192.168.17.192<0> scc-f05:9009:9694 [7] NCCL INFO Using network Socket scc-f05:9005:9005 [3] NCCL INFO cudaDriverVersion 12000 scc-f05:9005:9005 [3] NCCL INFO Bootstrap : Using ens106f0:192.168.17.192<0> scc-f05:9005:9005 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation scc-f05:9005:9696 [3] NCCL INFO NET/IB : No device found. scc-f05:9005:9696 [3] NCCL INFO NET/Socket : Using [0]ens106f0:192.168.17.192<0> scc-f05:9005:9696 [3] NCCL INFO Using network Socket scc-f05:9004:9004 [2] NCCL INFO cudaDriverVersion 12000 scc-f05:9004:9004 [2] NCCL INFO Bootstrap : Using ens106f0:192.168.17.192<0> scc-f05:9004:9004 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation scc-f05:9004:9697 [2] NCCL INFO NET/IB : No device found. scc-f05:9004:9697 [2] NCCL INFO NET/Socket : Using [0]ens106f0:192.168.17.192<0> scc-f05:9004:9697 [2] NCCL INFO Using network Socket scc-f05:9005:9696 [3] NCCL INFO Setting affinity for GPU 3 to ffff scc-f05:9004:9697 [2] NCCL INFO Setting affinity for GPU 2 to ffff scc-f05:9009:9694 [7] NCCL INFO Setting affinity for GPU 7 to ffff0000 scc-f05:9006:9695 [4] NCCL INFO Setting affinity for GPU 4 to ffff0000 scc-f05:9002:9690 [0] NCCL INFO Setting affinity for GPU 0 to ffff scc-f05:9003:9693 [1] NCCL INFO Setting affinity for GPU 1 to ffff scc-f05:9007:9692 [5] NCCL INFO Setting affinity for GPU 5 to ffff0000 scc-f05:9008:9691 [6] NCCL INFO Setting affinity for GPU 6 to ffff0000 scc-f05:9008:9691 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 scc-f05:9005:9696 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 scc-f05:9006:9695 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 scc-f05:9004:9697 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 scc-f05:9007:9692 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 scc-f05:9002:9690 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4 5 6 7 scc-f05:9003:9693 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 scc-f05:9009:9694 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 scc-f05:9002:9690 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4 5 6 7 scc-f05:9002:9690 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 scc-f05:9004:9697 [2] NCCL INFO Channel 00/0 : 2[56000] -> 3[57000] via P2P/IPC scc-f05:9008:9691 [6] NCCL INFO Channel 00/0 : 6[d5000] -> 7[d6000] via P2P/IPC scc-f05:9003:9693 [1] NCCL INFO Channel 00/0 : 1[52000] -> 2[56000] via P2P/IPC scc-f05:9004:9697 [2] NCCL INFO Channel 01/0 : 2[56000] -> 3[57000] via P2P/IPC scc-f05:9008:9691 [6] NCCL INFO Channel 01/0 : 6[d5000] -> 7[d6000] via P2P/IPC scc-f05:9005:9696 [3] NCCL INFO Channel 00 : 3[57000] -> 4[ce000] via SHM/direct/direct scc-f05:9009:9694 [7] NCCL INFO Channel 00 : 7[d6000] -> 0[4f000] via SHM/direct/direct scc-f05:9005:9696 [3] NCCL INFO Channel 01 : 3[57000] -> 4[ce000] via SHM/direct/direct scc-f05:9003:9693 [1] NCCL INFO Channel 01/0 : 1[52000] -> 2[56000] via P2P/IPC scc-f05:9009:9694 [7] NCCL INFO Channel 01 : 7[d6000] -> 0[4f000] via SHM/direct/direct scc-f05:9007:9692 [5] NCCL INFO Channel 00/0 : 5[d1000] -> 6[d5000] via P2P/IPC scc-f05:9007:9692 [5] NCCL INFO Channel 01/0 : 5[d1000] -> 6[d5000] via P2P/IPC scc-f05:9004:9697 [2] NCCL INFO Connected all rings scc-f05:9006:9695 [4] NCCL INFO Channel 00/0 : 4[ce000] -> 5[d1000] via P2P/IPC scc-f05:9008:9691 [6] NCCL INFO Connected all rings scc-f05:9004:9697 [2] NCCL INFO Channel 00/0 : 2[56000] -> 1[52000] via P2P/IPC scc-f05:9004:9697 [2] NCCL INFO Channel 01/0 : 2[56000] -> 1[52000] via P2P/IPC scc-f05:9008:9691 [6] NCCL INFO Channel 00/0 : 6[d5000] -> 5[d1000] via P2P/IPC scc-f05:9008:9691 [6] NCCL INFO Channel 01/0 : 6[d5000] -> 5[d1000] via P2P/IPC scc-f05:9006:9695 [4] NCCL INFO Channel 01/0 : 4[ce000] -> 5[d1000] via P2P/IPC scc-f05:9002:9690 [0] NCCL INFO Channel 00/0 : 0[4f000] -> 1[52000] via P2P/IPC scc-f05:9006:9695 [4] NCCL INFO Connected all rings scc-f05:9007:9692 [5] NCCL INFO Connected all rings scc-f05:9006:9695 [4] NCCL INFO Channel 00 : 4[ce000] -> 3[57000] via SHM/direct/direct scc-f05:9007:9692 [5] NCCL INFO Channel 00/0 : 5[d1000] -> 4[ce000] via P2P/IPC scc-f05:9006:9695 [4] NCCL INFO Channel 01 : 4[ce000] -> 3[57000] via SHM/direct/direct scc-f05:9007:9692 [5] NCCL INFO Channel 01/0 : 5[d1000] -> 4[ce000] via P2P/IPC scc-f05:9005:9696 [3] NCCL INFO Connected all rings scc-f05:9007:9692 [5] NCCL INFO Connected all trees scc-f05:9007:9692 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 scc-f05:9002:9690 [0] NCCL INFO Channel 01/0 : 0[4f000] -> 1[52000] via P2P/IPC scc-f05:9007:9692 [5] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer scc-f05:9003:9693 [1] NCCL INFO Connected all rings scc-f05:9002:9690 [0] NCCL INFO Connected all rings scc-f05:9003:9693 [1] NCCL INFO Channel 00/0 : 1[52000] -> 0[4f000] via P2P/IPC scc-f05:9009:9694 [7] NCCL INFO Connected all rings scc-f05:9003:9693 [1] NCCL INFO Channel 01/0 : 1[52000] -> 0[4f000] via P2P/IPC scc-f05:9009:9694 [7] NCCL INFO Channel 00/0 : 7[d6000] -> 6[d5000] via P2P/IPC scc-f05:9009:9694 [7] NCCL INFO Channel 01/0 : 7[d6000] -> 6[d5000] via P2P/IPC scc-f05:9002:9690 [0] NCCL INFO Connected all trees scc-f05:9002:9690 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 scc-f05:9002:9690 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer scc-f05:9009:9694 [7] NCCL INFO Connected all trees scc-f05:9009:9694 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 scc-f05:9003:9693 [1] NCCL INFO Connected all trees scc-f05:9009:9694 [7] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer scc-f05:9003:9693 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 scc-f05:9003:9693 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer scc-f05:9008:9691 [6] NCCL INFO Connected all trees scc-f05:9008:9691 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 scc-f05:9008:9691 [6] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer scc-f05:9005:9696 [3] NCCL INFO Channel 00/0 : 3[57000] -> 2[56000] via P2P/IPC scc-f05:9005:9696 [3] NCCL INFO Channel 01/0 : 3[57000] -> 2[56000] via P2P/IPC scc-f05:9004:9697 [2] NCCL INFO Connected all trees scc-f05:9004:9697 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 scc-f05:9004:9697 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer scc-f05:9005:9696 [3] NCCL INFO Connected all trees scc-f05:9005:9696 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 scc-f05:9005:9696 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer scc-f05:9006:9695 [4] NCCL INFO Connected all trees scc-f05:9006:9695 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 scc-f05:9006:9695 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer scc-f05:9008:9691 [6] NCCL INFO comm 0x1ee50540 rank 6 nranks 8 cudaDev 6 busId d5000 - Init COMPLETE scc-f05:9004:9697 [2] NCCL INFO comm 0x1aa83400 rank 2 nranks 8 cudaDev 2 busId 56000 - Init COMPLETE scc-f05:9009:9694 [7] NCCL INFO comm 0x16d68140 rank 7 nranks 8 cudaDev 7 busId d6000 - Init COMPLETE scc-f05:9002:9690 [0] NCCL INFO comm 0x19592ec0 rank 0 nranks 8 cudaDev 0 busId 4f000 - Init COMPLETE scc-f05:9003:9693 [1] NCCL INFO comm 0x6da4d5c0 rank 1 nranks 8 cudaDev 1 busId 52000 - Init COMPLETE scc-f05:9005:9696 [3] NCCL INFO comm 0x1971e100 rank 3 nranks 8 cudaDev 3 busId 57000 - Init COMPLETE scc-f05:9006:9695 [4] NCCL INFO comm 0x193b8540 rank 4 nranks 8 cudaDev 4 busId ce000 - Init COMPLETE scc-f05:9007:9692 [5] NCCL INFO comm 0x298d52c0 rank 5 nranks 8 cudaDev 5 busId d1000 - Init COMPLETE

zzoliman avatar Jun 05 '23 19:06 zzoliman

We need much more info about your situation. Output of accelerate env? GPU you're using? How are you creating your dataloader etc?

muellerzr avatar Jun 05 '23 20:06 muellerzr

I have the same problem an my env as follow `Copy-and-paste the text below in your GitHub issue

  • Accelerate version: 0.20.3
  • Platform: Linux-4.14.0_1-0-0-44-x86_64-with-debian-stretch-sid
  • Python version: 3.7.0
  • Numpy version: 1.21.5
  • PyTorch version (GPU?): 1.12.1+cu102 (True)
  • PyTorch XPU available: False
  • System RAM: 503.19 GB
  • GPU type: Tesla V100-SXM2-32GB
  • Accelerate default config: - compute_environment: LOCAL_MACHINE - distributed_type: FSDP - mixed_precision: no - use_cpu: False - num_processes: 2 - machine_rank: 0 - num_machines: 1 - rdzv_backend: static - same_network: True - main_training_function: main - fsdp_config: {'fsdp_auto_wrap_policy': 'SIZE_BASED_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'fsdp_min_num_params': 100000000, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 1, 'fsdp_state_dict_type': 'SHARDED_STATE_DICT'} - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: []`

lsabrinax avatar Jun 28 '23 07:06 lsabrinax

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jul 22 '23 15:07 github-actions[bot]