hydra-torch Basic distributed processing with Hydra

Implemented a basic script that demonstrates distributed processing with Hydra, as mentioned in #42. The command to run the script is:

python ddp_00.py -m rank=... init_method=...

where rank is a list of the ranks (either a comma-separated list of integers or range(start, stop)) and init_method is a string that specifies one of the two possible initialization methods: TCP initialization and shared file-system initialization (environment variable initialization is not related to init_method).

I'll add documentation (Markdown) that explains the distributed processing in PyTorch as well as how Hydra can kick off distributed processes as demonstrated in the script.

Dec 06 '20 11:12 briankosw

rank can also be a range(start,stop).

Dec 06 '20 17:12 omry

rank can also be a range(start,stop).

Not sure if the user even needs to provide the rank argument explicitly. rank has to vary from 0 to num_gpus- 1 (for standard usecases). So we might just infer it ourselves. I understand why does it have to be specified manually right now but this could be a useful example for callbacks.

Dec 07 '20 17:12 shagunsodhani

rank can also be a range(start,stop).

Not sure if the user even needs to provide the rank argument explicitly. rank has to vary from 0 to num_gpus- 1 (for standard usecases). So we might just infer it ourselves. I understand why does it have to be specified manually right now but this could be a useful example for callbacks.

I agree that it's not the best user experience but we don't have calllback right now and even if we did, we need to make sure the design would actually support this. (It's not obvious that this is the case).

Dec 07 '20 17:12 omry

rank can also be a range(start,stop).

Not sure if the user even needs to provide the rank argument explicitly. rank has to vary from 0 to num_gpus- 1 (for standard usecases). So we might just infer it ourselves. I understand why does it have to be specified manually right now but this could be a useful example for callbacks.

I agree that it's not the best user experience but we don't have calllback right now and even if we did, we need to make sure the design would actually support this. (It's not obvious that this is the case).

@shagunsodhani What would it take to implement a callback?

Jan 13 '21 21:01 romesco

rank can also be a range(start,stop).

Not sure if the user even needs to provide the rank argument explicitly. rank has to vary from 0 to num_gpus- 1 (for standard usecases). So we might just infer it ourselves. I understand why does it have to be specified manually right now but this could be a useful example for callbacks.

I agree that it's not the best user experience but we don't have calllback right now and even if we did, we need to make sure the design would actually support this. (It's not obvious that this is the case).

@shagunsodhani What would it take to implement a callback?

Oops sorry missed this comment :) @omry will have a better insight about that. ccing @jieru-hu who is working on callbacks.

Jan 15 '21 19:01 shagunsodhani

Callbacks will likely be pushed back to Hydra 1.2.

Jan 15 '21 22:01 omry

@briankosw I'd love to help you push this forward. How are you doing? Bogged down in other work - because I know that feeling haha! Let me know how I can help.

Apr 22 '21 18:04 romesco

Callbacks will likely be pushed back to Hydra 1.2.

We will have callbacks in 1.1, but I am no longer sure we should use them here. This is not documented yet, but will be before 1.1 is released (Example app).

I am no longer sure we should leverage callbacks here, but it should now be possible to play with it on master.

Apr 22 '21 19:04 omry

hydra-torch hydra-torch copied to clipboard

Basic distributed processing with Hydra

hydra-torch
hydra-torch copied to clipboard