GenerativeAIExamples icon indicating copy to clipboard operation
GenerativeAIExamples copied to clipboard

LoRA weight merging giving torch distributed error on single-node single-gpu

Open ayushbits opened this issue 10 months ago • 0 comments

I am running this notebook. However, when I try to merge LoRA and model weights before exporting to TensorRTLLM (python /opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py). I received the following error:

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:53747 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
Error executing job with overrides: ['trainer.accelerator=gpu', 'tensor_model_parallel_size=1', 'pipeline_model_parallel_size=1', 'gpt_model_file=gemma_2b_pt.nemo', 'lora_model_path=nemo_experiments/gemma_lora_pubmedqa/checkpoints/gemma_lora_pubmedqa.nemo', 'merged_model_path=gemma_lora_pubmedqa_merged.nemo']
Traceback (most recent call last):
 File "/opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py", line 171, in main
   model = MegatronGPTModel.restore_from(
 File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/nlp_model.py", line 478, in restore_from
   return super().restore_from(
 File "/usr/local/lib/python3.10/dist-packages/nemo/core/classes/modelPT.py", line 468, in restore_from
   instance = cls._save_restore_connector.restore_from(
 File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/parts/nlp_overrides.py", line 1306, in restore_from
   trainer.strategy.setup_environment()
 File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 154, in setup_environment
   self.setup_distributed()
 File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/parts/nlp_overrides.py", line 244, in setup_distributed
   super().setup_distributed()
 File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 203, in setup_distributed
   _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
 File "/usr/local/lib/python3.10/dist-packages/lightning_fabric/utilities/distributed.py", line 297, in _init_dist_connection
   torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
   func_return = func(*args, **kwargs)
 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1172, in init_process_group
   store, rank, world_size = next(rendezvous_iterator)
 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/rendezvous.py", line 244, in _env_rendezvous_handler
   store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store
   return TCPStore(
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:53747 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).

Setup Information:

torch: 2.2.0a0+81ea7a4
nemo: 2.0
Container: nvcr.io/nvidia/nemo:24.01.gemma

ayushbits avatar Dec 11 '24 06:12 ayushbits