accelerate
accelerate copied to clipboard
False device placement when use with quantization_config
System Info
- `Accelerate` version: 0.31.0
- Platform: Linux-5.4.0-148-generic-x86_64-with-glibc2.31
- `accelerate` bash location: /remote-home/xhwang/anaconda3/envs/gloq/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 1007.54 GB
- GPU type: NVIDIA A800-SXM4-80GB
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - [X] My own task or dataset (give details below)
Reproduction
Hi! I want to load my model to cpu initially and use ddp for some sub modules with accelerator.prepare()
later. Here is an simple reproduction:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from accelerate import Accelerator
import torch
model = AutoModelForCausalLM.from_pretrained(
'models/Llama-2-7b-hf-2bit-64rank-5iter', # base model obtained by LoftQ, should be equivalent to 'LoftQ/Llama-2-7b-hf-2bit-64rank'
torch_dtype=torch.bfloat16,
device_map='cpu',
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=False,
bnb_4bit_quant_type='nf4'
),
)
sub_module = model.model.layers[10]
accelerator = Accelerator()
sub_module = accelerator.prepare(sub_module)
print(sub_module.device) # cuda:0
dummy_inputs = torch.randn(1, 2048, 4096).to(accelerator.device)
position_ids = torch.arange(2048).unsqueeze(0).to(accelerator.device)
output = sub_module(dummy_inputs, position_ids=position_ids)
Got RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
I see that the tensors are sent to cpu by some device alignment hooks added by accelerate. Is this expected? Or any workaround? Thanks for any help!
Expected behavior
Model and tensors should both be on cuda.