azure-docs Not enough information on distributed training

Not enough information on distributed training

Open mebristo opened this issue 2 years ago • 1 comments

With SDK v1 there is useful documentation on how to distribute training - e.g. creating a PyTorchConfiguration so that AML sets the the environment variables MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and NODE_RANK , RANK and LOCAL_RANK. With a v2 Command Job these environment variables don't seem to get set by AML, and there's no practical information on how to do distributed training

Document Details

⚠ Do not edit this section. It is required for learn.microsoft.com ➟ GitHub issue linking.

ID: fd613a35-8267-03f0-a2ac-d7f7b6ecb39c
Version Independent ID: 1ab82617-c8db-ce84-d0e1-5a508fd47489
Content: What is distributed training? - Azure Machine Learning
Content Source: articles/machine-learning/concept-distributed-training.md
Service: machine-learning
Sub-service: core
GitHub Login: @rtanase
Microsoft Alias: ratanase

Dec 16 '22 09:12 mebristo

@mebristo

Thanks for your feedback! We will investigate and update as appropriate.

Dec 16 '22 16:12 RamanathanChinnappan-MSFT

@mebristo I have assigned this to content author @rtanase to check and share his valuable insights on this.

Dec 17 '22 06:12 Naveenommi-MSFT

We added several months ago a dedicated page for SDK v2: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-train-distributed-gpu?view=azureml-api-2 and at the same time there are several tutorials on azureml-examples.

May 15 '23 15:05 rtanase

As indicated above, more information has been added. #please-close

Sep 14 '23 16:09 sdgilley

azure-docs azure-docs copied to clipboard

Not enough information on distributed training

Document Details

azure-docs
azure-docs copied to clipboard