Update multi-node.qmd
Title: Distributed Finetuning For Multi-Node with Axolotl and Deepspeed
Description: This PR introduces a comprehensive guide for setting up a distributed finetuning environment using Axolotl and Accelerate. The guide covers the following steps:
- Configuring SSH for passwordless access across multiple nodes
- Generating and exchanging public keys for secure communication
- Configuring Axolotl with shared settings and host files
- Configuring Accelerate for multi-node training with Deepspeed
- Running distributed finetuning using Accelerate
@muellerzr seem right?
This seems to assume that you have access to each node before your training starts. However, a lot of cloud systems like AzureML, SLURM, SageMaker does not let you follow guides like this because the assumptions of the guide is that you can modify these variables.
@shahdivax @winglian I would suggest a bit more of an automatic setup if you want this to work well for users.
This seems to assume that you have access to each node before your training starts. However, a lot of cloud systems like AzureML, SLURM, SageMaker does not let you follow guides like this because the assumptions of the guide is that you can modify these variables.
@shahdivax @winglian I would suggest a bit more of an automatic setup if you want this to work well for users.
This assumes that user are using EC2 instances from AWS.
( I forgot to add that 😓)
Edit: Added in the heading