accelerate
accelerate copied to clipboard
Unable to launch DeepSpeed multinode training with a heterogenous mix of # devices per node.
System Info
- `Accelerate` version: 0.30.0
- Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /home/crow/venvs/bismuth/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.26.2
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 124.95 GB
- GPU type: NVIDIA RTX 6000 Ada Generation
- `Accelerate` config passed:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 5
- machine_rank: 0
- num_machines: 2
- main_process_ip: localhost
- main_process_port: 3141
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- deepspeed_config: {'deepspeed_hostfile': '/home/crow/SoftwareProjects/rwkv-raven-lora-instruct/hostfile', 'deepspeed_multinode_launcher': 'pdsh', 'gradient_accumulation_steps': 8, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': True, 'zero3_save_16bit_model': False, 'zero_stage': 3}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - [X] My own task or dataset (give details below)
My snippet of code for loading model resources:
rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
print(f"I am rank: {rank}!")
acc_kwargs = {
"gradient_accumulation_steps": self.grad_accum,
"project_dir": self.output_dir,
"project_config": ProjectConfiguration(
project_dir=self.output_dir,
automatic_checkpoint_naming=True,
),
"mixed_precision": "bf16" if self.use_bfloat16 else "fp16"
}
acc_kwargs = {**acc_kwargs, **self.accelerator_kwargs}
self.accelerator = Accelerator(**acc_kwargs)
modified_load_kwargs = self.model_load_kwargs
if self.use_fsdp or self.use_deep_speed:
if self.use_fsdp:
modified_load_kwargs["low_cpu_mem_usage"] = True
del modified_load_kwargs["device_map"]
modified_load_kwargs["torch_dtype"] = torch.bfloat16 if self.use_bfloat16 else torch.float16
if (self.use_fsdp or self.use_deep_speed) and rank != 0:
with init_empty_weights():
self.model = AutoModelForCausalLM.from_pretrained(
self.model_name_or_path,
quantization_config=self.bnb_config,
**modified_load_kwargs,
)
else:
self.model = AutoModelForCausalLM.from_pretrained(
self.model_name_or_path,
quantization_config=self.bnb_config,
**modified_load_kwargs,
)
Launched with accelerate launch --config_file deep.cfg <SCRIPT>
Reproduction
Have two connected nodes with differing numbers of gpus, preferably even and odd (in my case one node has 4 and one has 1), launch DeepSpeed with either standard or pdsh launchers, watch the device ordinals be incorrect on one of the machines when attempting to load model resources.
Expected behavior
In both cases standard and pdsh, but pdsh in particular since it takes a hostfile with the number of slots per host, I would expect accelerate to to launch processes equivalent to the number of available devices per host and not just attempt to evenly split the processes across hosts.
For further context my full cluster will have 9 devices, two nodes of 4 and one node of 1. to answer why the weird setup. It just seems like such a waste to let my rtx a6000 sit out of all the fun my 3090s are having, plus the additional system and vram from it. 😄
I'm not entirely sure where it's getting these from, but it seems like its ignoring the config file entirely.
It looks like the issue stems from this part of the accelerate launcher: https://github.com/huggingface/accelerate/blob/4ad4d28c49a9818e985ea12d66a89fe73fe73c87/src/accelerate/utils/launch.py#L309
It's not considering individual machine's devices or the slots specified in the deepspeed hostfile or anything like that.
Correct. We don’t support that currently. At least on the torch side, I believe they require the same number of machines per node.
Does DeepSpeed not?
I had looked into it and I don't believe an equal number of devices is strictly required on either torch or DeepSpeed. Nothing with NCCL or MPI / DeepSpeeds launcher would make me immediately think that.
With that said from my end I’m more than happy to implement changes here to support this and contribute back I just want to make sure that’s desirable from your guys end.
@muellerzr It works fine if I just comment out that line above in a fork, I'd like your opinion on how to proceed here.
Attached is a screenshot showing the training session in the top right on a custom loop using accelerate.
The one GPU machine is printing SMI on the left and the 4 gpu box on the bottom right.
For more context DeepSpeed can handle arbitrary arrangements of GPUs during training, it will respect whatever number of slots are set in the provided hostfile.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.