CrossPoint-DDP RuntimeError: [2] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout

RuntimeError: [2] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout

Open dempsey-wen opened this issue 1 year ago • 4 comments

Hello, Jerry Sun. Thank you for the sharing of your good implementation of DDP training for CrossPoint.

When I was conducting the training, I met the issue: work = default_pg.allgather([tensor_list], [tensor]) RuntimeError: [3] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout

It seems that the processes failed to communicate with each other when 'allgather' was conducted.

Here are the Parser settings: Namespace(backend='nccl', batch_size=1024, class_choice=None, dropout=0.5, emb_dims=1024, epochs=250, eval=False, exp_name='exp', ft_dataset='ModelNet40', gpu_id=0, img_model_path='', k=20, lr=0.001, master_addr='localhost', master_port='12355', model='dgcnn', model_path='', momentum=0.9, no_cuda=False, num_classes=40, num_ft_points=1024, num_pt_points=2048, num_workers=32, print_freq=200, rank=-1, resume=False, save_freq=50, scheduler='cos', seed=1, test_batch_size=16, use_sgd=False, wb_key='local-e6f***', wb_url='http://localhost:28282', world_size=4)

I was training the model on a server with 4 Nvidia 2080Ti. The running environment: Ubuntu 18.04, Nvidia driver 525.89.02, CUDA 10.2.

The following is my trials to solve the problem: To figure out the reason of communication failure among processes, I monitored the system status with htop and nvidia-smi.

It was shown that only GPU 0 was processing and the rest were idle. However, the program occupied memory of the four GPUs. I suppose the model was conveyed to 4 GPUs, but no data was transmitted to GPU 1, 2, 3. So the master process cannot gain response from the other processes. 微信图片编辑_20231215165040

Could you provide any ideas about how to fix the problem?

Thank you for your time! ;)

Dec 15 '23 08:12 dempsey-wen

Do you have same environment settings with mine? I list my environment settings in the README.md, such as CUDA and PyTorch vesion, etc.

I don't encounter your problems so I am not clear about the reason of your bug. I suggest you reproduce the experiments with my settings. Ensure the port 12355 and 28282 are not used by other processes since you use these ports in the experiments.

Dec 19 '23 07:12 auniquesun

Yes, I created a new conda env CrossPoint to conduct:
pip install torch==1.11.0+cu102 torchvision==0.12.0+cu102 --extra-index-url https://download.pytorch.org/whl/cu102 pip install -r requirements.txt

Here is the package installed on CrossPoint: cudatoolkit 10.2.89 python 3.7.13 pytorch 1.11.0 torch 1.11.0+cu102 torchvision 0.12.0+cu102

I believe that 28282 is not occupied by other programs as I can access the dashboard of wandb. I tried to modify the master port to 12366, and the same issue remained.

Dec 19 '23 08:12 dempsey-wen

Update: The problem is solved with adding the setting before DDP initialization (dist.init_process_group()):

    os.environ.setdefault("CUDA_VISIBLE_DEVICES", "0, 1, 2, 3")
    torch.cuda.set_device(rank)

Here is how I solved the problem: I met the issue when the training process of epoch 0 finished, as the log shows:

Since it is reported that the error raised in executing all_gather_object(). I tested the function in the beginning of my code, received the error message: RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1646755861072/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, invalid usage, NCCL version 21.0.3 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

The new error message helped me find the solution.

Jan 12 '24 07:01 dempsey-wen

@dempsey-wen Congratulations! Great Job!

Jan 12 '24 11:01 auniquesun

CrossPoint-DDP CrossPoint-DDP copied to clipboard

RuntimeError: [2] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store-&gt;get('1') got error: Socket Timeout

CrossPoint-DDP
CrossPoint-DDP copied to clipboard

RuntimeError: [2] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout