azure-docs icon indicating copy to clipboard operation
azure-docs copied to clipboard

Not enough information on distributed training

Open mebristo opened this issue 2 years ago • 1 comments

With SDK v1 there is useful documentation on how to distribute training - e.g. creating a PyTorchConfiguration so that AML sets the the environment variables MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and NODE_RANK , RANK and LOCAL_RANK. With a v2 Command Job these environment variables don't seem to get set by AML, and there's no practical information on how to do distributed training


Document Details

Do not edit this section. It is required for learn.microsoft.com ➟ GitHub issue linking.

mebristo avatar Dec 16 '22 09:12 mebristo

@mebristo

Thanks for your feedback! We will investigate and update as appropriate.

@mebristo I have assigned this to content author @rtanase to check and share his valuable insights on this.

Naveenommi-MSFT avatar Dec 17 '22 06:12 Naveenommi-MSFT

We added several months ago a dedicated page for SDK v2: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-train-distributed-gpu?view=azureml-api-2 and at the same time there are several tutorials on azureml-examples.

rtanase avatar May 15 '23 15:05 rtanase

As indicated above, more information has been added. #please-close

sdgilley avatar Sep 14 '23 16:09 sdgilley