Tim Wolff-Piggott

Results 5 comments of Tim Wolff-Piggott

I realise this issue might be better suited to [optuna-examples](https://github.com/optuna/optuna-examples); happy to move it over if this is the case.

Thanks @Crissman; apologies for the delay on this; will open in examples.

Yup, no problem. I'd actually be happy to contribute when you open the new one. I implemented the proposal and it works well with TorchElastic :)

This was a big source of confusion for my team at least- we were making reference to the Terraform provider docs, and ostensibly successfully configuring retries at the job level,...

This issue appears to be resolved by setting the environment variable `NCCL_SOCKET_IFNAME=eth0` inside each pod. This is slightly confusing as according to Nvidia's [documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-socket-ifname), the loopback interface should only be...