super-gradients
super-gradients copied to clipboard
multi-GPU training error : zmq.error.ZMQError: Address already in use
Describe the bug
When Trying to run training with multi-GPU, getting a ZMQ error
steps to reproduse
run the following code in ipython notebook
from super_gradients.training import Trainer
import super_gradients
from super_gradients.training.utils.distributed_training_utils import setup_device
from super_gradients.common.data_types.enum import MultiGPUMode
setup_device(multi_gpu=MultiGPUMode.DISTRIBUTED_DATA_PARALLEL, num_gpus=4)
CHECKPOINT_DIR = 'checkpoints'
trainer = Trainer(experiment_name='my_first_yolonas_run', ckpt_root_dir=CHECKPOINT_DIR)
error:
/home/dinusha/.virtualenvs/yolo-nas/lib/python3.8/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
[2023-05-05 00:20:09] INFO - crash_tips_setup.py - Crash tips is enabled. You can set your environment variable to CRASH_HANDLER=FALSE to disable it
The console stream is logged into /home/dinusha/sg_logs/console.log
/home/dinusha/.virtualenvs/yolo-nas/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/dinusha/.virtualenvs/yolo-nas/lib/python3.8/site-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
[2023-05-05 00:20:10] WARNING - __init__.py - Failed to import pytorch_quantization
[2023-05-05 00:20:10] WARNING - calibrator.py - Failed to import pytorch_quantization
[2023-05-05 00:20:10] WARNING - export.py - Failed to import pytorch_quantization
[2023-05-05 00:20:10] WARNING - selective_quantization_utils.py - Failed to import pytorch_quantization
[2023-05-05 00:20:10] INFO - distributed_training_utils.py - Launching DDP with:
- ddp_port = 57787
- num_gpus = 4/4 available
-------------------------------------
[2023-05-05 00:20:10] INFO - static_tcp_rendezvous.py - Creating TCPStore as the c10d::Store implementation
/home/dinusha/.virtualenvs/yolo-nas/lib/python3.8/site-packages/traitlets/traitlets.py:2548: FutureWarning: Supporting extra quotes around strings is deprecated in traitlets 5.0. You can use 'hmac-sha256' instead of '"hmac-sha256"' if you require traitlets >=5.
warn(
/home/dinusha/.virtualenvs/yolo-nas/lib/python3.8/site-packages/traitlets/traitlets.py:2499: FutureWarning: Supporting extra quotes around Bytes is deprecated in traitlets 5.0. Use 'f4fd0bdd-aca8-4881-a46e-1d203e82b684' instead of 'b"f4fd0bdd-aca8-4881-a46e-1d203e82b684"'.
warn(
Traceback (most recent call last):
File "/home/dinusha/.virtualenvs/yolo-nas/lib/python3.8/site-packages/ipykernel_launcher.py", line 17, in <module>
app.launch_new_instance()
File "/home/dinusha/.virtualenvs/yolo-nas/lib/python3.8/site-packages/traitlets/config/application.py", line 1042, in launch_instance
app.initialize(argv)
File "/home/dinusha/.virtualenvs/yolo-nas/lib/python3.8/site-packages/traitlets/config/application.py", line 113, in inner
return method(app, *args, **kwargs)
File "/home/dinusha/.virtualenvs/yolo-nas/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 678, in initialize
self.init_sockets()
...
File "zmq/backend/cython/socket.pyx", line 564, in zmq.backend.cython.socket.Socket.bind
File "zmq/backend/cython/checkrc.pxd", line 28, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use
[2023-05-05 00:20:15] ERROR - api.py - failed (exitcode: 1) local_rank: 0 (pid: 2508290) of binary: /home/dinusha/.virtualenvs/yolo-nas/bin/python
We currently don't support running DDP training from the notebook.
As a workaround you may try a trick with %%writefile magic
as shown in Kaggle notebook here: https://www.kaggle.com/code/onodera/ddp-example