Can't run 2 finetunes at the same time
To repro, launch 2 distributed finetunes on different CUDA devices. Runs into:
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).
This appears to be a KP in distributed: https://github.com/pytorch/pytorch/issues/73320, port 29500 is hardcoded as a default. To workaround, we could default --rdzv_endpoint to localhost:0 to have torchrun automatically select a free port in tune CLI.
Obviously, we'll need to clearly document it as it differs from distributed but this sounds good to me!
Curious - why would distributed have hardcoded this as a default in the first place?
@joecummings assigning this to you. If you agree, please add this, document the change in the README and close the issue.
I also think this issue and #393 are related. @joecummings will let you consolidate these if you agree.