azure-docs
azure-docs copied to clipboard
Not enough information on distributed training
With SDK v1 there is useful documentation on how to distribute training - e.g. creating a PyTorchConfiguration so that AML sets the the environment variables MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and NODE_RANK , RANK and LOCAL_RANK. With a v2 Command Job these environment variables don't seem to get set by AML, and there's no practical information on how to do distributed training
Document Details
⚠ Do not edit this section. It is required for learn.microsoft.com ➟ GitHub issue linking.
- ID: fd613a35-8267-03f0-a2ac-d7f7b6ecb39c
- Version Independent ID: 1ab82617-c8db-ce84-d0e1-5a508fd47489
- Content: What is distributed training? - Azure Machine Learning
- Content Source: articles/machine-learning/concept-distributed-training.md
- Service: machine-learning
- Sub-service: core
- GitHub Login: @rtanase
- Microsoft Alias: ratanase
@mebristo
Thanks for your feedback! We will investigate and update as appropriate.
@mebristo I have assigned this to content author @rtanase to check and share his valuable insights on this.
We added several months ago a dedicated page for SDK v2: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-train-distributed-gpu?view=azureml-api-2 and at the same time there are several tutorials on azureml-examples.
As indicated above, more information has been added. #please-close