accelerate False device placement when use with quantization

False device placement when use with quantization_config

Open xinghaow99 opened this issue 7 months ago • 2 comments

System Info

- `Accelerate` version: 0.31.0
- Platform: Linux-5.4.0-148-generic-x86_64-with-glibc2.31
- `accelerate` bash location: /remote-home/xhwang/anaconda3/envs/gloq/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 1007.54 GB
- GPU type: NVIDIA A800-SXM4-80GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - debug: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

Hi! I want to load my model to cpu initially and use ddp for some sub modules with accelerator.prepare() later. Here is an simple reproduction:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from accelerate import Accelerator
import torch
model = AutoModelForCausalLM.from_pretrained(
        'models/Llama-2-7b-hf-2bit-64rank-5iter', # base model obtained by LoftQ, should be equivalent to 'LoftQ/Llama-2-7b-hf-2bit-64rank'
        torch_dtype=torch.bfloat16,
        device_map='cpu',
        quantization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=False,
            bnb_4bit_quant_type='nf4'
        ),
    )
sub_module = model.model.layers[10]
accelerator = Accelerator()
sub_module = accelerator.prepare(sub_module)
print(sub_module.device) # cuda:0
dummy_inputs = torch.randn(1, 2048, 4096).to(accelerator.device)
position_ids = torch.arange(2048).unsqueeze(0).to(accelerator.device)
output = sub_module(dummy_inputs, position_ids=position_ids)

Got RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

I see that the tensors are sent to cpu by some device alignment hooks added by accelerate. Is this expected? Or any workaround? Thanks for any help!

Expected behavior

Model and tensors should both be on cuda.

Jun 30 '24 14:06 xinghaow99

accelerate accelerate copied to clipboard

False device placement when use with quantization_config

System Info

Information

Tasks

Reproduction

Expected behavior

accelerate
accelerate copied to clipboard