autotrain-advanced
autotrain-advanced copied to clipboard
[FEATURE REQUEST] Multi node training support
Feature Request
Ability to train LLMs like Llama 2 70B and Falcon 180B on multi node configuration using Slurm or Kubernetes
Motivation
with maximum of 8x H100 GPUs per node, bigger models need quantization to INT4 to run and loss of precision can be a concern, also ability to scale out can help to training bigger datasets faster
Additional Context
No response
This issue is stale because it has been open for 15 days with no activity.
This issue was closed because it has been inactive for 2 days since being marked as stale.