accelerate
accelerate copied to clipboard
Does not proceed after accelerator.prepare
Hi, I am trying to train a language model using the run_mlm_no_trainer.py script, using multi-gpus. However, I get stuck in the accelerator.prepare method (below is the code line where I get stuck).
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare( model, optimizer, train_dataloader, eval_dataloader, lr_scheduler )
Actually, it gets stuck in accelerator.wait_for_everyone() as well, but I removed that part and now I get stuck in prepare().
I used NCCL_DEBUG=INFO to get debugging information and below is the NCCL INFO that I get when the prepare() code is run. Could you help me with this problem?
scc-f05:9002:9002 [0] NCCL INFO Bootstrap : Using ens106f0:192.168.17.192<0>
scc-f05:9002:9002 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
scc-f05:9002:9002 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.14.3+cuda11.7
scc-f05:9007:9007 [5] NCCL INFO cudaDriverVersion 12000
scc-f05:9008:9008 [6] NCCL INFO cudaDriverVersion 12000
scc-f05:9003:9003 [1] NCCL INFO cudaDriverVersion 12000
scc-f05:9009:9009 [7] NCCL INFO cudaDriverVersion 12000
scc-f05:9006:9006 [4] NCCL INFO cudaDriverVersion 12000
scc-f05:9008:9008 [6] NCCL INFO Bootstrap : Using ens106f0:192.168.17.192<0>
scc-f05:9008:9008 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
scc-f05:9007:9007 [5] NCCL INFO Bootstrap : Using ens106f0:192.168.17.192<0>
scc-f05:9003:9003 [1] NCCL INFO Bootstrap : Using ens106f0:192.168.17.192<0>
scc-f05:9007:9007 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
scc-f05:9003:9003 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
scc-f05:9009:9009 [7] NCCL INFO Bootstrap : Using ens106f0:192.168.17.192<0>
scc-f05:9009:9009 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
scc-f05:9006:9006 [4] NCCL INFO Bootstrap : Using ens106f0:192.168.17.192<0>
scc-f05:9006:9006 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
scc-f05:9002:9690 [0] NCCL INFO NET/IB : No device found.
scc-f05:9002:9690 [0] NCCL INFO NET/Socket : Using [0]ens106f0:192.168.17.192<0>
scc-f05:9002:9690 [0] NCCL INFO Using network Socket
scc-f05:9008:9691 [6] NCCL INFO NET/IB : No device found.
scc-f05:9008:9691 [6] NCCL INFO NET/Socket : Using [0]ens106f0:192.168.17.192<0>
scc-f05:9008:9691 [6] NCCL INFO Using network Socket
scc-f05:9007:9692 [5] NCCL INFO NET/IB : No device found.
scc-f05:9003:9693 [1] NCCL INFO NET/IB : No device found.
scc-f05:9003:9693 [1] NCCL INFO NET/Socket : Using [0]ens106f0:192.168.17.192<0>
scc-f05:9007:9692 [5] NCCL INFO NET/Socket : Using [0]ens106f0:192.168.17.192<0>
scc-f05:9003:9693 [1] NCCL INFO Using network Socket
scc-f05:9007:9692 [5] NCCL INFO Using network Socket
scc-f05:9006:9695 [4] NCCL INFO NET/IB : No device found.
scc-f05:9009:9694 [7] NCCL INFO NET/IB : No device found.
scc-f05:9006:9695 [4] NCCL INFO NET/Socket : Using [0]ens106f0:192.168.17.192<0>
scc-f05:9006:9695 [4] NCCL INFO Using network Socket
scc-f05:9009:9694 [7] NCCL INFO NET/Socket : Using [0]ens106f0:192.168.17.192<0>
scc-f05:9009:9694 [7] NCCL INFO Using network Socket
scc-f05:9005:9005 [3] NCCL INFO cudaDriverVersion 12000
scc-f05:9005:9005 [3] NCCL INFO Bootstrap : Using ens106f0:192.168.17.192<0>
scc-f05:9005:9005 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
scc-f05:9005:9696 [3] NCCL INFO NET/IB : No device found.
scc-f05:9005:9696 [3] NCCL INFO NET/Socket : Using [0]ens106f0:192.168.17.192<0>
scc-f05:9005:9696 [3] NCCL INFO Using network Socket
scc-f05:9004:9004 [2] NCCL INFO cudaDriverVersion 12000
scc-f05:9004:9004 [2] NCCL INFO Bootstrap : Using ens106f0:192.168.17.192<0>
scc-f05:9004:9004 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
scc-f05:9004:9697 [2] NCCL INFO NET/IB : No device found.
scc-f05:9004:9697 [2] NCCL INFO NET/Socket : Using [0]ens106f0:192.168.17.192<0>
scc-f05:9004:9697 [2] NCCL INFO Using network Socket
scc-f05:9005:9696 [3] NCCL INFO Setting affinity for GPU 3 to ffff
scc-f05:9004:9697 [2] NCCL INFO Setting affinity for GPU 2 to ffff
scc-f05:9009:9694 [7] NCCL INFO Setting affinity for GPU 7 to ffff0000
scc-f05:9006:9695 [4] NCCL INFO Setting affinity for GPU 4 to ffff0000
scc-f05:9002:9690 [0] NCCL INFO Setting affinity for GPU 0 to ffff
scc-f05:9003:9693 [1] NCCL INFO Setting affinity for GPU 1 to ffff
scc-f05:9007:9692 [5] NCCL INFO Setting affinity for GPU 5 to ffff0000
scc-f05:9008:9691 [6] NCCL INFO Setting affinity for GPU 6 to ffff0000
scc-f05:9008:9691 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
scc-f05:9005:9696 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2
scc-f05:9006:9695 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3
scc-f05:9004:9697 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
scc-f05:9007:9692 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4
scc-f05:9002:9690 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4 5 6 7
scc-f05:9003:9693 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
scc-f05:9009:9694 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6
scc-f05:9002:9690 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4 5 6 7
scc-f05:9002:9690 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
scc-f05:9004:9697 [2] NCCL INFO Channel 00/0 : 2[56000] -> 3[57000] via P2P/IPC
scc-f05:9008:9691 [6] NCCL INFO Channel 00/0 : 6[d5000] -> 7[d6000] via P2P/IPC
scc-f05:9003:9693 [1] NCCL INFO Channel 00/0 : 1[52000] -> 2[56000] via P2P/IPC
scc-f05:9004:9697 [2] NCCL INFO Channel 01/0 : 2[56000] -> 3[57000] via P2P/IPC
scc-f05:9008:9691 [6] NCCL INFO Channel 01/0 : 6[d5000] -> 7[d6000] via P2P/IPC
scc-f05:9005:9696 [3] NCCL INFO Channel 00 : 3[57000] -> 4[ce000] via SHM/direct/direct
scc-f05:9009:9694 [7] NCCL INFO Channel 00 : 7[d6000] -> 0[4f000] via SHM/direct/direct
scc-f05:9005:9696 [3] NCCL INFO Channel 01 : 3[57000] -> 4[ce000] via SHM/direct/direct
scc-f05:9003:9693 [1] NCCL INFO Channel 01/0 : 1[52000] -> 2[56000] via P2P/IPC
scc-f05:9009:9694 [7] NCCL INFO Channel 01 : 7[d6000] -> 0[4f000] via SHM/direct/direct
scc-f05:9007:9692 [5] NCCL INFO Channel 00/0 : 5[d1000] -> 6[d5000] via P2P/IPC
scc-f05:9007:9692 [5] NCCL INFO Channel 01/0 : 5[d1000] -> 6[d5000] via P2P/IPC
scc-f05:9004:9697 [2] NCCL INFO Connected all rings
scc-f05:9006:9695 [4] NCCL INFO Channel 00/0 : 4[ce000] -> 5[d1000] via P2P/IPC
scc-f05:9008:9691 [6] NCCL INFO Connected all rings
scc-f05:9004:9697 [2] NCCL INFO Channel 00/0 : 2[56000] -> 1[52000] via P2P/IPC
scc-f05:9004:9697 [2] NCCL INFO Channel 01/0 : 2[56000] -> 1[52000] via P2P/IPC
scc-f05:9008:9691 [6] NCCL INFO Channel 00/0 : 6[d5000] -> 5[d1000] via P2P/IPC
scc-f05:9008:9691 [6] NCCL INFO Channel 01/0 : 6[d5000] -> 5[d1000] via P2P/IPC
scc-f05:9006:9695 [4] NCCL INFO Channel 01/0 : 4[ce000] -> 5[d1000] via P2P/IPC
scc-f05:9002:9690 [0] NCCL INFO Channel 00/0 : 0[4f000] -> 1[52000] via P2P/IPC
scc-f05:9006:9695 [4] NCCL INFO Connected all rings
scc-f05:9007:9692 [5] NCCL INFO Connected all rings
scc-f05:9006:9695 [4] NCCL INFO Channel 00 : 4[ce000] -> 3[57000] via SHM/direct/direct
scc-f05:9007:9692 [5] NCCL INFO Channel 00/0 : 5[d1000] -> 4[ce000] via P2P/IPC
scc-f05:9006:9695 [4] NCCL INFO Channel 01 : 4[ce000] -> 3[57000] via SHM/direct/direct
scc-f05:9007:9692 [5] NCCL INFO Channel 01/0 : 5[d1000] -> 4[ce000] via P2P/IPC
scc-f05:9005:9696 [3] NCCL INFO Connected all rings
scc-f05:9007:9692 [5] NCCL INFO Connected all trees
scc-f05:9007:9692 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
scc-f05:9002:9690 [0] NCCL INFO Channel 01/0 : 0[4f000] -> 1[52000] via P2P/IPC
scc-f05:9007:9692 [5] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
scc-f05:9003:9693 [1] NCCL INFO Connected all rings
scc-f05:9002:9690 [0] NCCL INFO Connected all rings
scc-f05:9003:9693 [1] NCCL INFO Channel 00/0 : 1[52000] -> 0[4f000] via P2P/IPC
scc-f05:9009:9694 [7] NCCL INFO Connected all rings
scc-f05:9003:9693 [1] NCCL INFO Channel 01/0 : 1[52000] -> 0[4f000] via P2P/IPC
scc-f05:9009:9694 [7] NCCL INFO Channel 00/0 : 7[d6000] -> 6[d5000] via P2P/IPC
scc-f05:9009:9694 [7] NCCL INFO Channel 01/0 : 7[d6000] -> 6[d5000] via P2P/IPC
scc-f05:9002:9690 [0] NCCL INFO Connected all trees
scc-f05:9002:9690 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
scc-f05:9002:9690 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
scc-f05:9009:9694 [7] NCCL INFO Connected all trees
scc-f05:9009:9694 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
scc-f05:9003:9693 [1] NCCL INFO Connected all trees
scc-f05:9009:9694 [7] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
scc-f05:9003:9693 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
scc-f05:9003:9693 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
scc-f05:9008:9691 [6] NCCL INFO Connected all trees
scc-f05:9008:9691 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
scc-f05:9008:9691 [6] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
scc-f05:9005:9696 [3] NCCL INFO Channel 00/0 : 3[57000] -> 2[56000] via P2P/IPC
scc-f05:9005:9696 [3] NCCL INFO Channel 01/0 : 3[57000] -> 2[56000] via P2P/IPC
scc-f05:9004:9697 [2] NCCL INFO Connected all trees
scc-f05:9004:9697 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
scc-f05:9004:9697 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
scc-f05:9005:9696 [3] NCCL INFO Connected all trees
scc-f05:9005:9696 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
scc-f05:9005:9696 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
scc-f05:9006:9695 [4] NCCL INFO Connected all trees
scc-f05:9006:9695 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
scc-f05:9006:9695 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
scc-f05:9008:9691 [6] NCCL INFO comm 0x1ee50540 rank 6 nranks 8 cudaDev 6 busId d5000 - Init COMPLETE
scc-f05:9004:9697 [2] NCCL INFO comm 0x1aa83400 rank 2 nranks 8 cudaDev 2 busId 56000 - Init COMPLETE
scc-f05:9009:9694 [7] NCCL INFO comm 0x16d68140 rank 7 nranks 8 cudaDev 7 busId d6000 - Init COMPLETE
scc-f05:9002:9690 [0] NCCL INFO comm 0x19592ec0 rank 0 nranks 8 cudaDev 0 busId 4f000 - Init COMPLETE
scc-f05:9003:9693 [1] NCCL INFO comm 0x6da4d5c0 rank 1 nranks 8 cudaDev 1 busId 52000 - Init COMPLETE
scc-f05:9005:9696 [3] NCCL INFO comm 0x1971e100 rank 3 nranks 8 cudaDev 3 busId 57000 - Init COMPLETE
scc-f05:9006:9695 [4] NCCL INFO comm 0x193b8540 rank 4 nranks 8 cudaDev 4 busId ce000 - Init COMPLETE
scc-f05:9007:9692 [5] NCCL INFO comm 0x298d52c0 rank 5 nranks 8 cudaDev 5 busId d1000 - Init COMPLETE
We need much more info about your situation. Output of accelerate env? GPU you're using? How are you creating your dataloader etc?
I have the same problem an my env as follow `Copy-and-paste the text below in your GitHub issue
Accelerateversion: 0.20.3- Platform: Linux-4.14.0_1-0-0-44-x86_64-with-debian-stretch-sid
- Python version: 3.7.0
- Numpy version: 1.21.5
- PyTorch version (GPU?): 1.12.1+cu102 (True)
- PyTorch XPU available: False
- System RAM: 503.19 GB
- GPU type: Tesla V100-SXM2-32GB
Acceleratedefault config: - compute_environment: LOCAL_MACHINE - distributed_type: FSDP - mixed_precision: no - use_cpu: False - num_processes: 2 - machine_rank: 0 - num_machines: 1 - rdzv_backend: static - same_network: True - main_training_function: main - fsdp_config: {'fsdp_auto_wrap_policy': 'SIZE_BASED_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'fsdp_min_num_params': 100000000, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 1, 'fsdp_state_dict_type': 'SHARDED_STATE_DICT'} - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: []`
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.