kohya_ss
kohya_ss copied to clipboard
ConnectionError: Tried to launch distributed communication on port `29500`
ConnectionError: Tried to launch distributed communication on port 29500
, but another process is utilizing it. Please specify a different port (such as using the ----main_process_port
flag or specifying a different main_process_port
in your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to 0
.
same issue
This is an accelerate launch
parameter that can't be set in the GUI... I could add support for it if this is important...
any solution for this yet :) ? I have the same issue, and when I am setting --main_process_port=0 it just allocates GPUs and does not run the actual training.
any solution for this yet :) ? I have the same issue, and when I am setting --main_process_port=0 it just allocates GPUs and does not run the actual training.
@yasser-sulaiman maybe just try another port number, such as --main_process_port=29501
?
thanks for your answer @yuanzhi-zhu yes, it works that way but I want it to use the next free port instead of one specific port.
@yasser-sulaiman have you manager to fix this?
How does one solve this
For those using my setup: slurm, one node, number of GPUs = N, Number of tasks= N.
I export the WORLD_SIZE=N at the sbash script level.
And set up the other vars in the python script level using os.environ: LOCAL_RANK, MASTER_PORT, RANK, etc...
And simply start with;
Srun python script.py --vars...
@MostHumble unfortunately not, I am just assigning a random port manually like --main_process_port=12547