accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Incorrect Argument Default for DeepSpeed Multi-node Training

Open jomayeri opened this issue 8 months ago • 1 comments

System Info

pip install accelerate.

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [ ] My own task or dataset (give details below)

Reproduction

Run accelerate for multi-node training.

Expected behavior

Accelerate is setting the default DeepSpeed hostfile to None this overrides the DeepSpeed default of /job/hostfile. Overriding this default is causing issues with users attempting multi-node trading. Please change the default to match DeepSpeed's default.

jomayeri avatar Jun 18 '24 19:06 jomayeri