autotrain-advanced icon indicating copy to clipboard operation
autotrain-advanced copied to clipboard

[FEATURE REQUEST] Multi node training support

Open bronzafa opened this issue 1 year ago • 1 comments

Feature Request

Ability to train LLMs like Llama 2 70B and Falcon 180B on multi node configuration using Slurm or Kubernetes

Motivation

with maximum of 8x H100 GPUs per node, bigger models need quantization to INT4 to run and loss of precision can be a concern, also ability to scale out can help to training bigger datasets faster

Additional Context

No response

bronzafa avatar Feb 08 '24 20:02 bronzafa

This issue is stale because it has been open for 15 days with no activity.

github-actions[bot] avatar Feb 29 '24 15:02 github-actions[bot]

This issue was closed because it has been inactive for 2 days since being marked as stale.

github-actions[bot] avatar Mar 11 '24 15:03 github-actions[bot]