hydra-torch Single-node ImageNet DDP implementation

Implements ImageNet DDP, as mentioned in #33.

Most of the code is the same, and the major differences are the handling of distributed processes and the configuration.

One can use Hydra's multirun capability to launch the distributed processes, instead of using PyTorch or Python's multiprocessing API.

python imagenet.py -m rank=0,1,2,3

Dec 01 '20 04:12 briankosw

Thank you for the review @shagunsodhani. To reply to your comments:

I'm not exactly sure how multirun and Joblib interacts with multi-node multiprocessing. If you have any helpful pointers on those, I'd appreciate it, but I'll look into them myself. I'm more familiar with single-node multiprocessing, which is why I've been inclined to writing single-node code.
I've only included those so that it's consistent with the training script in PyTorch. I do agree with you that it is extraneous to this example, so I will clean those up!

Another thing that comes to mind is your comment on the other PR about how Joblib doesn't guarantee that all the subprocesses are launched simultaneously. Would they have any implications for the multi-node setup?

Dec 01 '20 10:12 briankosw

Thank you for the review @shagunsodhani. To reply to your comments:

I'm not exactly sure how multirun and Joblib interacts with multi-node multiprocessing. If you have any helpful pointers on those, I'd appreciate it, but I'll look into them myself. I'm more familiar with single-node multiprocessing, which is why I've been inclined to writing single-node code.

We launch one process per gpu. These gpus can live on any mode (the extreme case being one gpu per node). We need to handle two things:

How do the nodes discover each other (ie how do we know the master address). This relates to the comment on the previous PR.
I think the second change is smaller, we need to set the device (cfg.gpu) correctly. This is easy to fix once we know how many nodes are participating and how many gpus does each node have.

The way ahead will be to add the example config (for single node) and then we will see how is the master address set. That will give us some hint about how to get the master address on multi-node training.

I've only included those so that it's consistent with the training script in PyTorch. I do agree with you that it is extraneous to this example, so I will clean those up!

Another thing that comes to mind is your comment on the other PR about how Joblib doesn't guarantee that all the subprocesses are launched simultaneously. Would they have any implications for the multi-node setup?

Yeah so regarding this, imo a better way is to request n nodes and launch one process per node which spawns 8 workers. We can probably come back to this point later as it is orthogonal to the other changes we discussed and should be straightforward change.

Dec 01 '20 11:12 shagunsodhani

High level feedback: We will probably have multiple examples for distributed data parallel, with different limitations and advantages. It's good to group them together and have a top level page explaining what each one is to help users navigate.

Dec 01 '20 21:12 omry

High level feedback: We will probably have multiple examples for distributed data parallel, with different limitations and advantages. It's good to group them together and have a top level page explaining what each one is to help users navigate.

I think that's a good idea. I can try to organize and structure the examples where each example highlights something different, e.g. one example explains fundamental distributed processing as demonstrated in the other PR and another example showing ImageNet DDP. I think it'd be better if I open one or two additional issues that separate these implementations. What do you guys think?

In addition, I've given some thoughts on handling multi-node distributed processing, and I think it's easier if I have a separate example for single-node multi-GPU and multi-node multi-GPU. Thoughts on that as well?

Dec 03 '20 07:12 briankosw

High level feedback: We will probably have multiple examples for distributed data parallel, with different limitations and advantages. It's good to group them together and have a top level page explaining what each one is to help users navigate.

I think that's a good idea. I can try to organize and structure the examples where each example highlights something different, e.g. one example explains fundamental distributed processing as demonstrated in the other PR and another example showing ImageNet DDP. I think it'd be better if I open one or two additional issues that separate these implementations. What do you guys think?

In addition, I've given some thoughts on handling multi-node distributed processing, and I think it's easier if I have a separate example for single-node multi-GPU and multi-node multi-GPU. Thoughts on that as well?

Sounds good. Right now, I'm thinking each of these can be separated into their own issues [listed by priority in my mind]:

Single-node, multi-GPU:

Fundamentals for DDP via hydra while limiting extraneous code (minimum viable example)
DDP Imagenet example (this issue/PR)

Multi-node, multi-GPU:

Turn example (1) into multi-node?

Dec 04 '20 20:12 romesco

I have something in mind for (3) Will be easier to show once (1) has been pushed.

Dec 04 '20 23:12 shagunsodhani

Right now, this PR is blocked by this issue, so I will be focusing more on #42 and fixing the blocking issue.

Dec 05 '20 03:12 briankosw

hydra-torch hydra-torch copied to clipboard

Single-node ImageNet DDP implementation

hydra-torch
hydra-torch copied to clipboard