accelerate ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device

Hi, I tried to load deepspeed zero3 (large model, I need to split his weights) on multi-gpus. However, when launching zero3 with accelerate I got the following error:

ssh://benshoho@<ip>/home/benshoho/.conda/envs/accelerate_venv/bin/python -u -m accelerate.commands.launch /home/benshoho/projects/others/temp/demo.py
[2024-03-11 00:00:42,019] torch.distributed.run: [WARNING] 
[2024-03-11 00:00:42,019] torch.distributed.run: [WARNING] *****************************************
[2024-03-11 00:00:42,019] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-03-11 00:00:42,019] torch.distributed.run: [WARNING] *****************************************
/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
[2024-03-11 00:00:45,224] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-11 00:00:45,224] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-11 00:00:45,224] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-11 00:00:47,132] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-03-11 00:00:47,132] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-03-11 00:00:47,132] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-03-11 00:00:47,133] [INFO] [comm.py:637:init_distributed] cdb=None
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Traceback (most recent call last):
  File "/home/benshoho/projects/others/temp/demo.py", line 19, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True,
  File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
  File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2941, in from_pretrained
    raise ValueError(
ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.
Traceback (most recent call last):
  File "/home/benshoho/projects/others/temp/demo.py", line 19, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True,
  File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
  File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2941, in from_pretrained
    raise ValueError(
ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.
Traceback (most recent call last):
  File "/home/benshoho/projects/others/temp/demo.py", line 19, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True,
  File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
  File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2941, in from_pretrained
    raise ValueError(
ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.
[2024-03-11 00:00:52,077] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3702 closing signal SIGTERM
[2024-03-11 00:00:52,241] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3694) of binary: /home/benshoho/.conda/envs/accelerate_venv/bin/python
Traceback (most recent call last):
  File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1033, in <module>
    main()
  File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1029, in main
    launch_command(args)
  File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    deepspeed_launcher(args)
  File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 724, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/benshoho/projects/others/temp/demo.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-11_00:00:52
  host      : ise-6000-02
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3701)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-11_00:00:52
  host      : ise-6000-02
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3694)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Process finished with exit code 1

accelerate env:

- `Accelerate` version: 0.26.1
- Platform: Linux-3.10.0-1160.90.1.el7.x86_64-x86_64-with-glibc2.17
- Python version: 3.10.9
- Numpy version: 1.26.3
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 503.65 GB
- GPU type: NVIDIA RTX 6000 Ada Generation
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 3
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': True, 'zero3_save_16bit_model': False, 'zero_stage': 3}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Here is my code:

from accelerate import Accelerator
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

accelerator = Accelerator()

max_length = 2048
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_id, model_max_length=max_length)
tokenizer.pad_token = tokenizer.eos_token


model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True,
                                             torch_dtype=torch.bfloat16,
                                             device_map={"": accelerator.process_index},
                                             ).eval()
model = accelerator.prepare_model(model)
model.config.pad_token_id = tokenizer.pad_token_id

responses = []
for i in range(100):
    text = "how much is 1+1?"

    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(accelerator.local_process_index)
    outputs = model(**inputs)
    print("got response")

How can I use zero3 and accelerate in the same time? Thank you very much!

Mar 10 '24 22:03 Ofir408

as the error message says, deepspeed zero-3 can't accept device_map as arguments. you can remove device_map={"": accelerator.process_index}

model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True,
                                             torch_dtype=torch.bfloat16,
                                             # device_map={"": accelerator.process_index},
                                             ).eval()

Mar 11 '24 06:03 fancyerii

@fancyerii Right. But how can I use zero3 and accelerate in the same time? Because my model is large (70b) and I need to split his weights to multi-gpus.

Mar 11 '24 06:03 Ofir408

accelerate

use accelerate config. you can take my configs below as a reference. I have two nodes with 16 total gpus.

master node:

compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: false
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_process_ip: ip address of master node
main_process_port: 29500
main_training_function: main
mixed_precision: bf16
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

2nd node:

master node:

compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: false
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 1
main_process_ip: ip address of master node
main_process_port: 29500
main_training_function: main
mixed_precision: bf16
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Then run accelerate launch .... on each node.

Mar 11 '24 07:03 fancyerii

accelerate

use accelerate config. you can take my configs below as a reference. I have two nodes with 16 total gpus.

master node:

compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: false
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_process_ip: ip address of master node
main_process_port: 29500
main_training_function: main
mixed_precision: bf16
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

2nd node:

master node:

compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: false
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 1
main_process_ip: ip address of master node
main_process_port: 29500
main_training_function: main
mixed_precision: bf16
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Then run accelerate launch .... on each node.

Thanks. What do you have in the from_pretrained?

Mar 11 '24 07:03 Ofir408

accelerate

use accelerate config. you can take my configs below as a reference. I have two nodes with 16 total gpus. master node:

compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: false
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_process_ip: ip address of master node
main_process_port: 29500
main_training_function: main
mixed_precision: bf16
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

2nd node: master node:

compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: false
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 1
main_process_ip: ip address of master node
main_process_port: 29500
main_training_function: main
mixed_precision: bf16
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Then run accelerate launch .... on each node.

Thanks. What do you have in the from_pretrained?

meta-llama/Llama-2-70b-chat

Mar 11 '24 09:03 fancyerii

@fancyerii

Without device_map and device? Did you run it with accelerate launch?

Mar 11 '24 10:03 Ofir408

@fancyerii

Without device_map and device? Did you run it with accelerate launch?

yes, deepspeed will manage shard model to each process. accelerate already integrate pytorch ddp/fsdp, deepspeed zero and megatron-lm. You can run the same code with all these distributed training framework/tool. You just use accelerate config to tell accelerate this information. That's why accelerate is useful.

I think you should read basic concept in accelerate doc before training your model.

Mar 12 '24 01:03 fancyerii

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 10 '24 15:04 github-actions[bot]