super-gradients icon indicating copy to clipboard operation
super-gradients copied to clipboard

multi-GPU training error : zmq.error.ZMQError: Address already in use

Open Nuwan1654 opened this issue 1 year ago • 2 comments

Describe the bug

When Trying to run training with multi-GPU, getting a ZMQ error

steps to reproduse

run the following code in ipython notebook

from super_gradients.training import Trainer
import super_gradients
from super_gradients.training.utils.distributed_training_utils import setup_device
from super_gradients.common.data_types.enum import MultiGPUMode
setup_device(multi_gpu=MultiGPUMode.DISTRIBUTED_DATA_PARALLEL, num_gpus=4)

CHECKPOINT_DIR = 'checkpoints'
trainer = Trainer(experiment_name='my_first_yolonas_run', ckpt_root_dir=CHECKPOINT_DIR)

error:

/home/dinusha/.virtualenvs/yolo-nas/lib/python3.8/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[2023-05-05 00:20:09] INFO - crash_tips_setup.py - Crash tips is enabled. You can set your environment variable to CRASH_HANDLER=FALSE to disable it
The console stream is logged into /home/dinusha/sg_logs/console.log
/home/dinusha/.virtualenvs/yolo-nas/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/dinusha/.virtualenvs/yolo-nas/lib/python3.8/site-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
[2023-05-05 00:20:10] WARNING - __init__.py - Failed to import pytorch_quantization
[2023-05-05 00:20:10] WARNING - calibrator.py - Failed to import pytorch_quantization
[2023-05-05 00:20:10] WARNING - export.py - Failed to import pytorch_quantization
[2023-05-05 00:20:10] WARNING - selective_quantization_utils.py - Failed to import pytorch_quantization
[2023-05-05 00:20:10] INFO - distributed_training_utils.py - Launching DDP with:
   - ddp_port = 57787
   - num_gpus = 4/4 available
-------------------------------------

[2023-05-05 00:20:10] INFO - static_tcp_rendezvous.py - Creating TCPStore as the c10d::Store implementation
/home/dinusha/.virtualenvs/yolo-nas/lib/python3.8/site-packages/traitlets/traitlets.py:2548: FutureWarning: Supporting extra quotes around strings is deprecated in traitlets 5.0. You can use 'hmac-sha256' instead of '"hmac-sha256"' if you require traitlets >=5.
  warn(
/home/dinusha/.virtualenvs/yolo-nas/lib/python3.8/site-packages/traitlets/traitlets.py:2499: FutureWarning: Supporting extra quotes around Bytes is deprecated in traitlets 5.0. Use 'f4fd0bdd-aca8-4881-a46e-1d203e82b684' instead of 'b"f4fd0bdd-aca8-4881-a46e-1d203e82b684"'.
  warn(
Traceback (most recent call last):
  File "/home/dinusha/.virtualenvs/yolo-nas/lib/python3.8/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/home/dinusha/.virtualenvs/yolo-nas/lib/python3.8/site-packages/traitlets/config/application.py", line 1042, in launch_instance
    app.initialize(argv)
  File "/home/dinusha/.virtualenvs/yolo-nas/lib/python3.8/site-packages/traitlets/config/application.py", line 113, in inner
    return method(app, *args, **kwargs)
  File "/home/dinusha/.virtualenvs/yolo-nas/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 678, in initialize
    self.init_sockets()
...
  File "zmq/backend/cython/socket.pyx", line 564, in zmq.backend.cython.socket.Socket.bind
  File "zmq/backend/cython/checkrc.pxd", line 28, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use
[2023-05-05 00:20:15] ERROR - api.py - failed (exitcode: 1) local_rank: 0 (pid: 2508290) of binary: /home/dinusha/.virtualenvs/yolo-nas/bin/python

Nuwan1654 avatar May 05 '23 11:05 Nuwan1654

We currently don't support running DDP training from the notebook. As a workaround you may try a trick with %%writefile magic as shown in Kaggle notebook here: https://www.kaggle.com/code/onodera/ddp-example

BloodAxe avatar May 05 '23 11:05 BloodAxe