bevfusion icon indicating copy to clipboard operation
bevfusion copied to clipboard

Error when replacing torchpack.distributed with torch.distributed.launch

Open Deephome opened this issue 2 years ago • 6 comments

Hi, I tried to use torch.distributed.launch instead of torchpack.distributed for multi-gpu test. I modified the tools/test.py (according to mmdetection3d) and create a dist_test.sh as follows:

CONFIG=$1
CHECKPOINT=$2
GPUS=$3
PORT=${PORT:-29500}

PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
    $(dirname "$0")/test.py $CONFIG $CHECKPOINT --launcher pytorch ${@:4}

when I run the following command,

CUDA_VISIBLE_DEVICES=2 ./tools/dist_test.sh configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml pretrained/bevfusion-det.pth 1 --show

I got this error:

File "./tools/test.py", line 248, in <module>
    main()
  File "./tools/test.py", line 221, in main
    outputs = multi_gpu_test(model, data_loader, args.tmpdir, args.gpu_collect)
  File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmdet/apis/test.py", line 96, in multi_gpu_test
    for i, data in enumerate(data_loader):
  File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 359, in __iter__
    return self._get_iterator()
  File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 305, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 918, in __init__
    w.start()
  File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'dict_keys' object
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 76380) of binary: /home/ubuntu/anaconda3/envs/mmlab/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

Please give me some advice, thank you!

Deephome avatar Oct 08 '22 08:10 Deephome

I haven't tried using torch.distributed.launch for a while. I would suggest you to have a look at all the places we use torchpack.distributed in tools/test.py and replace them with the torch.distributed equivalence. I remember that the distributed initialization for torchpack and torch might be different (but I might be wrong, @zhijian-liu can also comment).

kentang-mit avatar Oct 08 '22 16:10 kentang-mit

@kentang-mit Thank you for your reply! I already replaced torchpack.distributed in tools/test.py by torch.distributed. Specifically, I comment

    # dist.init()

    # torch.backends.cudnn.benchmark = True
    # torch.cuda.set_device(dist.local_rank())

and add

    # init distributed env first, since logger depends on the dist info.
    cfg.dist_params = dict(backend='nccl')
    print("args.launcher", args.launcher)    
    if args.launcher == 'none':
        distributed = False
    else:
        distributed = True
        init_dist(args.launcher, **cfg.dist_params)

but it seems that there are no other places need to be modified.

Deephome avatar Oct 09 '22 06:10 Deephome

I just jumped to the function definition of dist.init() in tools/train.py and changed master_host:

# master_host = 'tcp://' + os.environ['MASTER_HOST']
master_host = None 

, then it is ok to run

python -m torch.distributed.launch --nproc_per_node=$GPUS ...

yangxh11 avatar Oct 12 '22 01:10 yangxh11

I just jumped to the function definition of dist.init() and changed master_host:

# master_host = 'tcp://' + os.environ['MASTER_HOST']
master_host = None 

, then it is ok to run

python -m torch.distributed.launch --nproc_per_node=$GPUS ...

It is for training, hope that helps you.

yangxh11 avatar Oct 12 '22 01:10 yangxh11

Thanks @yangxh11 for the help! @Deephome, would you mind having a look at this solution and see whether it works for you?

kentang-mit avatar Oct 12 '22 01:10 kentang-mit

@yangxh11 @kentang-mit Thank you for your suggestion! I'll have a try!

Deephome avatar Oct 12 '22 06:10 Deephome

Closed due to inactivity.

kentang-mit avatar Nov 04 '22 11:11 kentang-mit

@yangxh11, I have tried your suggestion to set master_host = None in dist.init() in torchpack, so I can run torch.distributed.launch but got this error:

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Could you please give me some advice, thank you!

YoushaaMurhij avatar Jan 03 '23 17:01 YoushaaMurhij

Have you checked the MPI installation?

yangxh11 avatar Feb 03 '23 08:02 yangxh11

I just jumped to the function definition of dist.init() in tools/train.py and changed master_host:

# master_host = 'tcp://' + os.environ['MASTER_HOST']
master_host = None 

, then it is ok to run

python -m torch.distributed.launch --nproc_per_node=$GPUS ...

Hello, I encountered an error "RuntimeError: connect() timed out. Original timeout was 1800000 ms" after using the method you provided. May I ask if you have made any other settings?

hasaikeyQAQ avatar Apr 15 '23 10:04 hasaikeyQAQ

I just jumped to the function definition of dist.init() in tools/train.py and changed master_host:

# master_host = 'tcp://' + os.environ['MASTER_HOST']
master_host = None 

, then it is ok to run

python -m torch.distributed.launch --nproc_per_node=$GPUS ...

Hello, I encountered an error "RuntimeError: connect() timed out. Original timeout was 1800000 ms" after using the method you provided. May I ask if you have made any other settings?

Nothing special. Have you set the master_port in your command?

yangxh11 avatar Apr 23 '23 09:04 yangxh11

I just jumped to the function definition of dist.init() in tools/train.py and changed master_host:

# master_host = 'tcp://' + os.environ['MASTER_HOST']
master_host = None 

, then it is ok to run

python -m torch.distributed.launch --nproc_per_node=$GPUS ...

Hello, I encountered an error "RuntimeError: connect() timed out. Original timeout was 1800000 ms" after using the method you provided. May I ask if you have made any other settings?

Hello, I have also encountered this problem. Have you solved it and how did you solve it ?

study0101 avatar Apr 29 '23 09:04 study0101

I made a modification to the definition of init in torchpack/distributed/context.py, replacing the original usage of mpi4py package with _world_size = int(os.environ['WORLD_SIZE']), _world_rank = int(os.environ['RANK']), and _local_rank = int(os.environ['LOCAL_RANK']). Following the error pointed out by @yangxh11, I also set the value of master_port in the program.

我只是跳到tools/train.py中dist.init()的函数定义,修改了master_host:

# master_host = 'tcp://' + os.environ['MASTER_HOST']
master_host = None 

,那么运行就可以了

python -m torch.distributed.launch --nproc_per_node=$GPUS ...

您好,我在使用您提供的方法后遇到错误“RuntimeError: connect() timed out. 原始超时为 1800000 ms”。请问你有没有做其他设置?

你好,我也遇到过这个问题。你解决了吗?你是怎么解决的?

hasaikeyQAQ avatar Apr 29 '23 09:04 hasaikeyQAQ