axolotl Update multi-node.qmd

Title: Distributed Finetuning For Multi-Node with Axolotl and Deepspeed

Description: This PR introduces a comprehensive guide for setting up a distributed finetuning environment using Axolotl and Accelerate. The guide covers the following steps:

Configuring SSH for passwordless access across multiple nodes
Generating and exchanging public keys for secure communication
Configuring Axolotl with shared settings and host files
Configuring Accelerate for multi-node training with Deepspeed
Running distributed finetuning using Accelerate

Jun 07 '24 06:06 shahdivax

@muellerzr seem right?

Jun 07 '24 22:06 winglian

This seems to assume that you have access to each node before your training starts. However, a lot of cloud systems like AzureML, SLURM, SageMaker does not let you follow guides like this because the assumptions of the guide is that you can modify these variables.

@shahdivax @winglian I would suggest a bit more of an automatic setup if you want this to work well for users.

Jun 11 '24 11:06 casper-hansen

This seems to assume that you have access to each node before your training starts. However, a lot of cloud systems like AzureML, SLURM, SageMaker does not let you follow guides like this because the assumptions of the guide is that you can modify these variables.

@shahdivax @winglian I would suggest a bit more of an automatic setup if you want this to work well for users.

This assumes that user are using EC2 instances from AWS.

( I forgot to add that 😓)

Edit: Added in the heading

Jun 11 '24 11:06 shahdivax