ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.
Hi, I tried to load deepspeed zero3 (large model, I need to split his weights) on multi-gpus. However, when launching zero3 with accelerate I got the following error:
ssh://benshoho@<ip>/home/benshoho/.conda/envs/accelerate_venv/bin/python -u -m accelerate.commands.launch /home/benshoho/projects/others/temp/demo.py
[2024-03-11 00:00:42,019] torch.distributed.run: [WARNING]
[2024-03-11 00:00:42,019] torch.distributed.run: [WARNING] *****************************************
[2024-03-11 00:00:42,019] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-03-11 00:00:42,019] torch.distributed.run: [WARNING] *****************************************
/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
[2024-03-11 00:00:45,224] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-11 00:00:45,224] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-11 00:00:45,224] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-11 00:00:47,132] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-03-11 00:00:47,132] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-03-11 00:00:47,132] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-03-11 00:00:47,133] [INFO] [comm.py:637:init_distributed] cdb=None
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Traceback (most recent call last):
File "/home/benshoho/projects/others/temp/demo.py", line 19, in <module>
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True,
File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2941, in from_pretrained
raise ValueError(
ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.
Traceback (most recent call last):
File "/home/benshoho/projects/others/temp/demo.py", line 19, in <module>
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True,
File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2941, in from_pretrained
raise ValueError(
ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.
Traceback (most recent call last):
File "/home/benshoho/projects/others/temp/demo.py", line 19, in <module>
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True,
File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2941, in from_pretrained
raise ValueError(
ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.
[2024-03-11 00:00:52,077] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3702 closing signal SIGTERM
[2024-03-11 00:00:52,241] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3694) of binary: /home/benshoho/.conda/envs/accelerate_venv/bin/python
Traceback (most recent call last):
File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1033, in <module>
main()
File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1029, in main
launch_command(args)
File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
deepspeed_launcher(args)
File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 724, in deepspeed_launcher
distrib_run.run(args)
File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/benshoho/.conda/envs/accelerate_venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/benshoho/projects/others/temp/demo.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-03-11_00:00:52
host : ise-6000-02
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3701)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-03-11_00:00:52
host : ise-6000-02
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3694)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Process finished with exit code 1
accelerate env:
- `Accelerate` version: 0.26.1
- Platform: Linux-3.10.0-1160.90.1.el7.x86_64-x86_64-with-glibc2.17
- Python version: 3.10.9
- Numpy version: 1.26.3
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 503.65 GB
- GPU type: NVIDIA RTX 6000 Ada Generation
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 3
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': True, 'zero3_save_16bit_model': False, 'zero_stage': 3}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Here is my code:
from accelerate import Accelerator
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
accelerator = Accelerator()
max_length = 2048
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_id, model_max_length=max_length)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map={"": accelerator.process_index},
).eval()
model = accelerator.prepare_model(model)
model.config.pad_token_id = tokenizer.pad_token_id
responses = []
for i in range(100):
text = "how much is 1+1?"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(accelerator.local_process_index)
outputs = model(**inputs)
print("got response")
How can I use zero3 and accelerate in the same time? Thank you very much!
as the error message says, deepspeed zero-3 can't accept device_map as arguments. you can remove device_map={"": accelerator.process_index}
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True,
torch_dtype=torch.bfloat16,
# device_map={"": accelerator.process_index},
).eval()
@fancyerii Right. But how can I use zero3 and accelerate in the same time? Because my model is large (70b) and I need to split his weights to multi-gpus.
accelerate
use accelerate config. you can take my configs below as a reference. I have two nodes with 16 total gpus.
master node:
compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
deepspeed_multinode_launcher: standard
gradient_accumulation_steps: 1
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero3_save_16bit_model: false
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_process_ip: ip address of master node
main_process_port: 29500
main_training_function: main
mixed_precision: bf16
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
2nd node:
master node:
compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
deepspeed_multinode_launcher: standard
gradient_accumulation_steps: 1
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero3_save_16bit_model: false
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 1
main_process_ip: ip address of master node
main_process_port: 29500
main_training_function: main
mixed_precision: bf16
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Then run accelerate launch .... on each node.
accelerate
use accelerate config. you can take my configs below as a reference. I have two nodes with 16 total gpus.
master node:
compute_environment: LOCAL_MACHINE debug: true deepspeed_config: deepspeed_multinode_launcher: standard gradient_accumulation_steps: 1 offload_optimizer_device: cpu offload_param_device: cpu zero3_init_flag: true zero3_save_16bit_model: false zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_process_ip: ip address of master node main_process_port: 29500 main_training_function: main mixed_precision: bf16 num_machines: 2 num_processes: 16 rdzv_backend: static same_network: false tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false2nd node:
master node:
compute_environment: LOCAL_MACHINE debug: true deepspeed_config: deepspeed_multinode_launcher: standard gradient_accumulation_steps: 1 offload_optimizer_device: cpu offload_param_device: cpu zero3_init_flag: true zero3_save_16bit_model: false zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 1 main_process_ip: ip address of master node main_process_port: 29500 main_training_function: main mixed_precision: bf16 num_machines: 2 num_processes: 16 rdzv_backend: static same_network: false tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: falseThen run accelerate launch .... on each node.
Thanks. What do you have in the from_pretrained?
accelerate
use accelerate config. you can take my configs below as a reference. I have two nodes with 16 total gpus. master node:
compute_environment: LOCAL_MACHINE debug: true deepspeed_config: deepspeed_multinode_launcher: standard gradient_accumulation_steps: 1 offload_optimizer_device: cpu offload_param_device: cpu zero3_init_flag: true zero3_save_16bit_model: false zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_process_ip: ip address of master node main_process_port: 29500 main_training_function: main mixed_precision: bf16 num_machines: 2 num_processes: 16 rdzv_backend: static same_network: false tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false2nd node: master node:
compute_environment: LOCAL_MACHINE debug: true deepspeed_config: deepspeed_multinode_launcher: standard gradient_accumulation_steps: 1 offload_optimizer_device: cpu offload_param_device: cpu zero3_init_flag: true zero3_save_16bit_model: false zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 1 main_process_ip: ip address of master node main_process_port: 29500 main_training_function: main mixed_precision: bf16 num_machines: 2 num_processes: 16 rdzv_backend: static same_network: false tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: falseThen run accelerate launch .... on each node.
Thanks. What do you have in the
from_pretrained?
meta-llama/Llama-2-70b-chat
@fancyerii
Without device_map and device? Did you run it with accelerate launch?
@fancyerii
Without device_map and device? Did you run it with accelerate launch?
yes, deepspeed will manage shard model to each process. accelerate already integrate pytorch ddp/fsdp, deepspeed zero and megatron-lm. You can run the same code with all these distributed training framework/tool. You just use accelerate config to tell accelerate this information. That's why accelerate is useful.
I think you should read basic concept in accelerate doc before training your model.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.