torchtune Can't run 2 finetunes at the same time

To repro, launch 2 distributed finetunes on different CUDA devices. Runs into:

torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).

This appears to be a KP in distributed: https://github.com/pytorch/pytorch/issues/73320, port 29500 is hardcoded as a default. To workaround, we could default --rdzv_endpoint to localhost:0 to have torchrun automatically select a free port in tune CLI.

Feb 20 '24 03:02 rohan-varma

Obviously, we'll need to clearly document it as it differs from distributed but this sounds good to me!

Curious - why would distributed have hardcoded this as a default in the first place?

Feb 20 '24 14:02 joecummings

@joecummings assigning this to you. If you agree, please add this, document the change in the README and close the issue.

Feb 25 '24 16:02 kartikayk

I also think this issue and #393 are related. @joecummings will let you consolidate these if you agree.

Feb 25 '24 17:02 kartikayk