Tim Wolff-Piggott
Tim Wolff-Piggott
I realise this issue might be better suited to [optuna-examples](https://github.com/optuna/optuna-examples); happy to move it over if this is the case.
Thanks @Crissman; apologies for the delay on this; will open in examples.
Yup, no problem. I'd actually be happy to contribute when you open the new one. I implemented the proposal and it works well with TorchElastic :)
This was a big source of confusion for my team at least- we were making reference to the Terraform provider docs, and ostensibly successfully configuring retries at the job level,...
This issue appears to be resolved by setting the environment variable `NCCL_SOCKET_IFNAME=eth0` inside each pod. This is slightly confusing as according to Nvidia's [documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-socket-ifname), the loopback interface should only be...