kohya_ss icon indicating copy to clipboard operation
kohya_ss copied to clipboard

ConnectionError: Tried to launch distributed communication on port `29500`

Open zmy-08 opened this issue 11 months ago • 2 comments

ConnectionError: Tried to launch distributed communication on port 29500, but another process is utilizing it. Please specify a different port (such as using the ----main_process_port flag or specifying a different main_process_port in your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to 0.

zmy-08 avatar Mar 22 '24 02:03 zmy-08

same issue

XiaoXiaoJiangYun avatar Mar 28 '24 04:03 XiaoXiaoJiangYun

This is an accelerate launch parameter that can't be set in the GUI... I could add support for it if this is important...

bmaltais avatar Mar 29 '24 14:03 bmaltais

any solution for this yet :) ? I have the same issue, and when I am setting --main_process_port=0 it just allocates GPUs and does not run the actual training.

yasser-sulaiman avatar May 14 '24 08:05 yasser-sulaiman

any solution for this yet :) ? I have the same issue, and when I am setting --main_process_port=0 it just allocates GPUs and does not run the actual training.

@yasser-sulaiman maybe just try another port number, such as --main_process_port=29501 ?

yuanzhi-zhu avatar May 22 '24 02:05 yuanzhi-zhu

thanks for your answer @yuanzhi-zhu yes, it works that way but I want it to use the next free port instead of one specific port.

yasser-sulaiman avatar May 22 '24 07:05 yasser-sulaiman

@yasser-sulaiman have you manager to fix this?

MostHumble avatar May 30 '24 06:05 MostHumble

How does one solve this

Aquahugs avatar May 30 '24 21:05 Aquahugs

For those using my setup: slurm, one node, number of GPUs = N, Number of tasks= N.

I export the WORLD_SIZE=N at the sbash script level.

And set up the other vars in the python script level using os.environ: LOCAL_RANK, MASTER_PORT, RANK, etc...

And simply start with;

Srun python script.py --vars...

MostHumble avatar May 31 '24 04:05 MostHumble

@MostHumble unfortunately not, I am just assigning a random port manually like --main_process_port=12547

yasser-sulaiman avatar May 31 '24 12:05 yasser-sulaiman