accelerate Unable to launch DeepSpeed multinode training with a heterogenous mix of # devices per node.

System Info

- `Accelerate` version: 0.30.0
- Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /home/crow/venvs/bismuth/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.26.2
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 124.95 GB
- GPU type: NVIDIA RTX 6000 Ada Generation
- `Accelerate` config passed:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: DEEPSPEED
	- mixed_precision: bf16
	- use_cpu: False
	- debug: False
	- num_processes: 5
	- machine_rank: 0
	- num_machines: 2
	- main_process_ip: localhost
	- main_process_port: 3141
	- rdzv_backend: static
	- same_network: True
	- main_training_function: main
	- enable_cpu_affinity: False
	- deepspeed_config: {'deepspeed_hostfile': '/home/crow/SoftwareProjects/rwkv-raven-lora-instruct/hostfile', 'deepspeed_multinode_launcher': 'pdsh', 'gradient_accumulation_steps': 8, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': True, 'zero3_save_16bit_model': False, 'zero_stage': 3}
	- downcast_bf16: no
	- tpu_use_cluster: False
	- tpu_use_sudo: False
	- tpu_env: []

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

My snippet of code for loading model resources:

        rank = int(os.environ["RANK"])
        world_size = int(os.environ["WORLD_SIZE"])

        print(f"I am rank: {rank}!")

        acc_kwargs = {
            "gradient_accumulation_steps": self.grad_accum,
            "project_dir": self.output_dir,
            "project_config": ProjectConfiguration(
                project_dir=self.output_dir,
                automatic_checkpoint_naming=True,
            ),
            "mixed_precision": "bf16" if self.use_bfloat16 else "fp16"
        } 

        acc_kwargs = {**acc_kwargs, **self.accelerator_kwargs}

        self.accelerator = Accelerator(**acc_kwargs)

        modified_load_kwargs = self.model_load_kwargs

        if self.use_fsdp or self.use_deep_speed:
            if self.use_fsdp:
                modified_load_kwargs["low_cpu_mem_usage"] = True

            del modified_load_kwargs["device_map"]
            modified_load_kwargs["torch_dtype"] = torch.bfloat16 if self.use_bfloat16 else torch.float16

        if (self.use_fsdp or self.use_deep_speed) and rank != 0:
            with init_empty_weights():
                self.model = AutoModelForCausalLM.from_pretrained(
                    self.model_name_or_path,
                    quantization_config=self.bnb_config,
                    **modified_load_kwargs,
                )
        else:
            self.model = AutoModelForCausalLM.from_pretrained(
                self.model_name_or_path,
                quantization_config=self.bnb_config,
                **modified_load_kwargs,
            )

Launched with accelerate launch --config_file deep.cfg <SCRIPT>

Reproduction

Have two connected nodes with differing numbers of gpus, preferably even and odd (in my case one node has 4 and one has 1), launch DeepSpeed with either standard or pdsh launchers, watch the device ordinals be incorrect on one of the machines when attempting to load model resources.

Expected behavior

In both cases standard and pdsh, but pdsh in particular since it takes a hostfile with the number of slots per host, I would expect accelerate to to launch processes equivalent to the number of available devices per host and not just attempt to evenly split the processes across hosts.

For further context my full cluster will have 9 devices, two nodes of 4 and one node of 1. to answer why the weird setup. It just seems like such a waste to let my rtx a6000 sit out of all the fun my 3090s are having, plus the additional system and vram from it. 😄

May 14 '24 22:05 iantbutler01

Screenshot 2024-05-15 at 9 09 26 PM

I'm not entirely sure where it's getting these from, but it seems like its ignoring the config file entirely.

May 16 '24 04:05 iantbutler01

It looks like the issue stems from this part of the accelerate launcher: https://github.com/huggingface/accelerate/blob/4ad4d28c49a9818e985ea12d66a89fe73fe73c87/src/accelerate/utils/launch.py#L309

It's not considering individual machine's devices or the slots specified in the deepspeed hostfile or anything like that.

May 16 '24 05:05 iantbutler01

Correct. We don’t support that currently. At least on the torch side, I believe they require the same number of machines per node.

Does DeepSpeed not?

May 17 '24 01:05 muellerzr

I had looked into it and I don't believe an equal number of devices is strictly required on either torch or DeepSpeed. Nothing with NCCL or MPI / DeepSpeeds launcher would make me immediately think that.

With that said from my end I’m more than happy to implement changes here to support this and contribute back I just want to make sure that’s desirable from your guys end.

May 17 '24 04:05 iantbutler01

@muellerzr It works fine if I just comment out that line above in a fork, I'd like your opinion on how to proceed here.

Attached is a screenshot showing the training session in the top right on a custom loop using accelerate.

The one GPU machine is printing SMI on the left and the 4 gpu box on the bottom right.

For more context DeepSpeed can handle arbitrary arrangements of GPUs during training, it will respect whatever number of slots are set in the provided hostfile.

May 27 '24 08:05 iantbutler01

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jun 20 '24 15:06 github-actions[bot]

accelerate accelerate copied to clipboard

Unable to launch DeepSpeed multinode training with a heterogenous mix of # devices per node.

System Info

Information

Tasks

Reproduction

Expected behavior

accelerate
accelerate copied to clipboard