VirConv icon indicating copy to clipboard operation
VirConv copied to clipboard

Failed in multiple GPU training

Open EvW1998 opened this issue 1 year ago • 1 comments

I could train with a single GPU, but when I try to run with multiple GPU by running dist_train.sh, the program stopped without reporting anything.

My dist_train.sh is like this:

CUDA_VISIBLE_DEVICES=0,1 nohup python3 -m torch.distributed.launch --nproc_per_node=2 --master_port 29501 train.py --launcher pytorch > log.txt&

The log.txt shows like this:

/usr/local/miniconda3/envs/pcdt/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torch.distributed.run. Note that --use_env is set by default in torch.distributed.run. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Feels like something wrong with distributed, any ideas? Thanks

EvW1998 avatar Jan 04 '24 15:01 EvW1998