axolotl icon indicating copy to clipboard operation
axolotl copied to clipboard

Update multi-node.qmd

Open shahdivax opened this issue 1 year ago • 3 comments

Title: Distributed Finetuning For Multi-Node with Axolotl and Deepspeed

Description: This PR introduces a comprehensive guide for setting up a distributed finetuning environment using Axolotl and Accelerate. The guide covers the following steps:

  1. Configuring SSH for passwordless access across multiple nodes
  2. Generating and exchanging public keys for secure communication
  3. Configuring Axolotl with shared settings and host files
  4. Configuring Accelerate for multi-node training with Deepspeed
  5. Running distributed finetuning using Accelerate

shahdivax avatar Jun 07 '24 06:06 shahdivax

@muellerzr seem right?

winglian avatar Jun 07 '24 22:06 winglian

This seems to assume that you have access to each node before your training starts. However, a lot of cloud systems like AzureML, SLURM, SageMaker does not let you follow guides like this because the assumptions of the guide is that you can modify these variables.

@shahdivax @winglian I would suggest a bit more of an automatic setup if you want this to work well for users.

casper-hansen avatar Jun 11 '24 11:06 casper-hansen

This seems to assume that you have access to each node before your training starts. However, a lot of cloud systems like AzureML, SLURM, SageMaker does not let you follow guides like this because the assumptions of the guide is that you can modify these variables.

@shahdivax @winglian I would suggest a bit more of an automatic setup if you want this to work well for users.

This assumes that user are using EC2 instances from AWS.

( I forgot to add that 😓)

Edit: Added in the heading

shahdivax avatar Jun 11 '24 11:06 shahdivax