Brian Ko
Brian Ko
Hey @akihironitta, I'd love to help with this :) Should I just pick one and create a PR?
I guess I'll work on `pl_bolts.datamodules` then!
@omry, let me know if you think another issue for `torch.distributed` + multirun will be useful here too.
@omry @romesco sorry for the delay, I've transcribed the ImageNet DDP code and added Hydra configurations to it (a lot of it was inspired by the helpful MNIST example).
Thank you for the review @shagunsodhani. To reply to your comments: 1. I'm not exactly sure how multirun and Joblib interacts with multi-node multiprocessing. If you have any helpful pointers...
> High level feedback: > We will probably have multiple examples for distributed data parallel, with different limitations and advantages. > It's good to group them together and have a...
Right now, this PR is blocked by [this issue](https://github.com/facebookresearch/hydra/issues/1180), so I will be focusing more on #42 and fixing the blocking issue.
@romesco would love your feedback on this!
> Sounds great! What do you think about using the MNIST example as a base? Or did you have something even simpler in mind? If you check [this PR](https://github.com/facebookresearch/hydra/pull/1141) out,...
I see. That's great! Could you help me understand how the script ensures that only one of the master nodes run `kubeadm init` and the rest run `kubeadm join`? I've...