autotrain-advanced [FEATURE REQUEST] Multi node training support

[FEATURE REQUEST] Multi node training support

Open bronzafa opened this issue 1 year ago • 1 comments

Feature Request

Ability to train LLMs like Llama 2 70B and Falcon 180B on multi node configuration using Slurm or Kubernetes

Motivation

with maximum of 8x H100 GPUs per node, bigger models need quantization to INT4 to run and loss of precision can be a concern, also ability to scale out can help to training bigger datasets faster

Additional Context

No response

Feb 08 '24 20:02 bronzafa

This issue is stale because it has been open for 15 days with no activity.

Feb 29 '24 15:02 github-actions[bot]

This issue was closed because it has been inactive for 2 days since being marked as stale.

Mar 11 '24 15:03 github-actions[bot]

autotrain-advanced autotrain-advanced copied to clipboard

[FEATURE REQUEST] Multi node training support

Feature Request

Motivation

Additional Context

autotrain-advanced
autotrain-advanced copied to clipboard