bevfusion
bevfusion copied to clipboard
Error when replacing torchpack.distributed with torch.distributed.launch
Hi, I tried to use torch.distributed.launch instead of torchpack.distributed for multi-gpu test. I modified the tools/test.py (according to mmdetection3d) and create a dist_test.sh as follows:
CONFIG=$1
CHECKPOINT=$2
GPUS=$3
PORT=${PORT:-29500}
PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
$(dirname "$0")/test.py $CONFIG $CHECKPOINT --launcher pytorch ${@:4}
when I run the following command,
CUDA_VISIBLE_DEVICES=2 ./tools/dist_test.sh configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml pretrained/bevfusion-det.pth 1 --show
I got this error:
File "./tools/test.py", line 248, in <module>
main()
File "./tools/test.py", line 221, in main
outputs = multi_gpu_test(model, data_loader, args.tmpdir, args.gpu_collect)
File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmdet/apis/test.py", line 96, in multi_gpu_test
for i, data in enumerate(data_loader):
File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 359, in __iter__
return self._get_iterator()
File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 305, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 918, in __init__
w.start()
File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'dict_keys' object
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 76380) of binary: /home/ubuntu/anaconda3/envs/mmlab/bin/python
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Please give me some advice, thank you!
I haven't tried using torch.distributed.launch for a while. I would suggest you to have a look at all the places we use torchpack.distributed
in tools/test.py
and replace them with the torch.distributed
equivalence. I remember that the distributed initialization for torchpack and torch might be different (but I might be wrong, @zhijian-liu can also comment).
@kentang-mit Thank you for your reply! I already replaced torchpack.distributed in tools/test.py
by torch.distributed
. Specifically, I comment
# dist.init()
# torch.backends.cudnn.benchmark = True
# torch.cuda.set_device(dist.local_rank())
and add
# init distributed env first, since logger depends on the dist info.
cfg.dist_params = dict(backend='nccl')
print("args.launcher", args.launcher)
if args.launcher == 'none':
distributed = False
else:
distributed = True
init_dist(args.launcher, **cfg.dist_params)
but it seems that there are no other places need to be modified.
I just jumped to the function definition of dist.init() in tools/train.py and changed master_host:
# master_host = 'tcp://' + os.environ['MASTER_HOST']
master_host = None
, then it is ok to run
python -m torch.distributed.launch --nproc_per_node=$GPUS ...
I just jumped to the function definition of dist.init() and changed master_host:
# master_host = 'tcp://' + os.environ['MASTER_HOST'] master_host = None
, then it is ok to run
python -m torch.distributed.launch --nproc_per_node=$GPUS ...
It is for training, hope that helps you.
Thanks @yangxh11 for the help! @Deephome, would you mind having a look at this solution and see whether it works for you?
@yangxh11 @kentang-mit Thank you for your suggestion! I'll have a try!
Closed due to inactivity.
@yangxh11, I have tried your suggestion to set master_host = None
in dist.init()
in torchpack, so I can run torch.distributed.launch
but got this error:
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
Could you please give me some advice, thank you!
Have you checked the MPI installation?
I just jumped to the function definition of dist.init() in tools/train.py and changed master_host:
# master_host = 'tcp://' + os.environ['MASTER_HOST'] master_host = None
, then it is ok to run
python -m torch.distributed.launch --nproc_per_node=$GPUS ...
Hello, I encountered an error "RuntimeError: connect() timed out. Original timeout was 1800000 ms" after using the method you provided. May I ask if you have made any other settings?
I just jumped to the function definition of dist.init() in tools/train.py and changed master_host:
# master_host = 'tcp://' + os.environ['MASTER_HOST'] master_host = None
, then it is ok to run
python -m torch.distributed.launch --nproc_per_node=$GPUS ...
Hello, I encountered an error "RuntimeError: connect() timed out. Original timeout was 1800000 ms" after using the method you provided. May I ask if you have made any other settings?
Nothing special. Have you set the master_port in your command?
I just jumped to the function definition of dist.init() in tools/train.py and changed master_host:
# master_host = 'tcp://' + os.environ['MASTER_HOST'] master_host = None
, then it is ok to run
python -m torch.distributed.launch --nproc_per_node=$GPUS ...
Hello, I encountered an error "RuntimeError: connect() timed out. Original timeout was 1800000 ms" after using the method you provided. May I ask if you have made any other settings?
Hello, I have also encountered this problem. Have you solved it and how did you solve it ?
I made a modification to the definition of init in torchpack/distributed/context.py, replacing the original usage of mpi4py package with _world_size = int(os.environ['WORLD_SIZE'])
, _world_rank = int(os.environ['RANK'])
, and _local_rank = int(os.environ['LOCAL_RANK'])
. Following the error pointed out by @yangxh11, I also set the value of master_port in the program.
我只是跳到tools/train.py中dist.init()的函数定义,修改了master_host:
# master_host = 'tcp://' + os.environ['MASTER_HOST'] master_host = None
,那么运行就可以了
python -m torch.distributed.launch --nproc_per_node=$GPUS ...
您好,我在使用您提供的方法后遇到错误“RuntimeError: connect() timed out. 原始超时为 1800000 ms”。请问你有没有做其他设置?
你好,我也遇到过这个问题。你解决了吗?你是怎么解决的?