SwinTrack Training error with multiple nodes using LaSOT

Training errors with multiple nodes using LaSOT:

[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800265 milliseconds before timing out. [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down

Jan 05 '22 12:01 YanjieLiang

We did not meet such a problem before. It looks like one node called torch.distributed.all_reduce while other nodes just run away.

Jan 12 '22 10:01 LitingLin

Thanks for the reply.

We use a machine with two GPUs for experiments and run the command ./run.sh SwinTrack Tiny --output_dir /userhome/SwinTrack/output/

We found that one GPU's Volatile is 100% while the other GPU's Volatile is 0% while the memory usage is OK. The message is as follows：

INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=13570 group_rank=0 group_world_size=1 local_ranks=[0, 1] role_ranks=[0, 1] global_ranks=[0, 1] role_world_sizes=[2, 2] global_world_sizes=[2, 2] INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_aenfcflm/Tiny-2022.01.12-20.27.32-832092_03nuwq7j/attempt_0/0/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_aenfcflm/Tiny-2022.01.12-20.27.32-832092_03nuwq7j/attempt_0/1/error.json | distributed init (rank 0[0.0]/2) using nccl | distributed init (rank 1[0.1]/2) using nccl [W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

Is the dataload process ONLY by CPU?

Jan 12 '22 12:01 YanjieLiang

SwinTrack SwinTrack copied to clipboard

Training error with multiple nodes using LaSOT

SwinTrack
SwinTrack copied to clipboard