hydra-torch icon indicating copy to clipboard operation
hydra-torch copied to clipboard

Basic distributed processing with Hydra

Open briankosw opened this issue 4 years ago • 8 comments

Implemented a basic script that demonstrates distributed processing with Hydra, as mentioned in #42. The command to run the script is:

python ddp_00.py -m rank=... init_method=...

where rank is a list of the ranks (either a comma-separated list of integers or range(start, stop)) and init_method is a string that specifies one of the two possible initialization methods: TCP initialization and shared file-system initialization (environment variable initialization is not related to init_method).

I'll add documentation (Markdown) that explains the distributed processing in PyTorch as well as how Hydra can kick off distributed processes as demonstrated in the script.

briankosw avatar Dec 06 '20 11:12 briankosw

rank can also be a range(start,stop).

omry avatar Dec 06 '20 17:12 omry

rank can also be a range(start,stop).

Not sure if the user even needs to provide the rank argument explicitly. rank has to vary from 0 to num_gpus- 1 (for standard usecases). So we might just infer it ourselves. I understand why does it have to be specified manually right now but this could be a useful example for callbacks.

shagunsodhani avatar Dec 07 '20 17:12 shagunsodhani

rank can also be a range(start,stop).

Not sure if the user even needs to provide the rank argument explicitly. rank has to vary from 0 to num_gpus- 1 (for standard usecases). So we might just infer it ourselves. I understand why does it have to be specified manually right now but this could be a useful example for callbacks.

I agree that it's not the best user experience but we don't have calllback right now and even if we did, we need to make sure the design would actually support this. (It's not obvious that this is the case).

omry avatar Dec 07 '20 17:12 omry

rank can also be a range(start,stop).

Not sure if the user even needs to provide the rank argument explicitly. rank has to vary from 0 to num_gpus- 1 (for standard usecases). So we might just infer it ourselves. I understand why does it have to be specified manually right now but this could be a useful example for callbacks.

I agree that it's not the best user experience but we don't have calllback right now and even if we did, we need to make sure the design would actually support this. (It's not obvious that this is the case).

@shagunsodhani What would it take to implement a callback?

romesco avatar Jan 13 '21 21:01 romesco

rank can also be a range(start,stop).

Not sure if the user even needs to provide the rank argument explicitly. rank has to vary from 0 to num_gpus- 1 (for standard usecases). So we might just infer it ourselves. I understand why does it have to be specified manually right now but this could be a useful example for callbacks.

I agree that it's not the best user experience but we don't have calllback right now and even if we did, we need to make sure the design would actually support this. (It's not obvious that this is the case).

@shagunsodhani What would it take to implement a callback?

Oops sorry missed this comment :) @omry will have a better insight about that. ccing @jieru-hu who is working on callbacks.

shagunsodhani avatar Jan 15 '21 19:01 shagunsodhani

Callbacks will likely be pushed back to Hydra 1.2.

omry avatar Jan 15 '21 22:01 omry

@briankosw I'd love to help you push this forward. How are you doing? Bogged down in other work - because I know that feeling haha! Let me know how I can help.

romesco avatar Apr 22 '21 18:04 romesco

Callbacks will likely be pushed back to Hydra 1.2.

We will have callbacks in 1.1, but I am no longer sure we should use them here. This is not documented yet, but will be before 1.1 is released (Example app).

I am no longer sure we should leverage callbacks here, but it should now be possible to play with it on master.

omry avatar Apr 22 '21 19:04 omry