dlrover icon indicating copy to clipboard operation
dlrover copied to clipboard

OSError: [Errno 98] Address already in use

Open chencjcj opened this issue 1 month ago • 3 comments

dlrover version:v0.3.5 megatron version:main I encountered an error when using flash checkpoint in megatron

Exception in thread checkpoint-saver: Traceback (most recent call last): File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 422, in _saver saver: AsyncCheckpointSaver = class_def(**class_meta.kwargs) File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 386, in init self._event_queue = SharedQueue(name=qname, create=True) File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/multi_process.py", line 369, in init super().init(name, create) File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/multi_process.py", line 188, in init self._init_socket() File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/multi_process.py", line 210, in _init_socket self._server = _create_socket_server(self._socket_file) File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/multi_process.py", line 71, in _create_socket_server server.bind(path) OSError: [Errno 98] Address already in use Exception ignored in: <function AsyncCheckpointSaver.del at 0x7efed6fbb490> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 402, in del [2024-05-10 07:57:02,115] [INFO] [ckpt_saver.py:429:_factory] Start the checkpoint saver factory. self.close() File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 494, in close if not self._event_queue.empty(): AttributeError: 'MegatronCheckpointSaver' object has no attribute '_event_queue'

chencjcj avatar May 10 '24 08:05 chencjcj