accelerate Multi GPU with custom device map and 4bit bnb quant

System Info

Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.28.0
- Platform: Linux-5.15.0-1056-azure-x86_64-with-glibc2.35
- Python version: 3.10.12
- Numpy version: 1.23.5
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 212.59 GB
- GPU type: Tesla V100-PCIE-16GB
- `Accelerate` default config:
	Not found

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

I'm tuning a Mistral qlora quant (via peft and bitsandbytes) on 2 GPUs. I want to force the model to split across 2 devices, because it's big enough to load on one (so balanced/sequential won't work), but immediately OOMs when tuning. When I try to specify as below (which is based on a map I get when setting the max_memory arbitrarily low and using infer_auto_device_map()) I get the following error:

ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8

Here's the code:

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from accelerate import dispatch_model

modelpath="BioMistral/BioMistral-7B"

# # Load (slow) Tokenizer, fast tokenizer sometimes ignores added tokens
tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast=False)

tokenizer.pad_token = tokenizer.eos_token

from transformers import BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model 

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load 4-bit quantized model
model = AutoModelForCausalLM.from_pretrained(
    modelpath,    
    device_map='sequential',
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
)

# Add LoRA adapters to 
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
config = LoraConfig(
    r=128, 
    lora_alpha=64,
    target_modules = [
        "q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"
    ],
    lora_dropout=0.05, 
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
model.config.use_cache = False

device_map = {
    "base_model.model.model.embed_tokens": 0,
    "base_model.model.model.layers.0": 0,
    "base_model.model.model.layers.1": 0,
    "base_model.model.model.layers.2": 0,
    "base_model.model.model.layers.3": 0,
    "base_model.model.model.layers.4": 0,
    "base_model.model.model.layers.5": 0,
    "base_model.model.model.layers.6": 0,
    "base_model.model.model.layers.7": 0,
    "base_model.model.model.layers.8": 0,
    "base_model.model.model.layers.9": 0,
    "base_model.model.model.layers.10": 0,
    "base_model.model.model.layers.11": 0,
    "base_model.model.model.layers.12": 0,
    "base_model.model.model.layers.13": 0,
    "base_model.model.model.layers.14": 0,
    "base_model.model.model.layers.15": 0,
    "base_model.model.model.layers.16": 0,
    "base_model.model.model.layers.17": 0,
    "base_model.model.model.layers.18": 0,
    "base_model.model.model.layers.19": 1,
    "base_model.model.model.layers.20": 1,
    "base_model.model.model.layers.21": 1,
    "base_model.model.model.layers.22": 1,
    "base_model.model.model.layers.23": 1,
    "base_model.model.model.layers.24": 1,
    "base_model.model.model.layers.25": 1,
    "base_model.model.model.layers.26": 1,
    "base_model.model.model.layers.27": 1,
    "base_model.model.model.layers.28": 1,
    "base_model.model.model.layers.29": 1,
    "base_model.model.model.layers.30": 1,
    "base_model.model.model.layers.31": 1,
    "base_model.model.model.norm": 1,
    "base_model.model.lm_head": 1,
    "base_model.model.model.layers.19.mlp": 1
}

model = dispatch_model(model, device_map=device_map)

Expected behavior

I want the model to be split across the GPUs as described in the device_map.

Mar 12 '24 20:03 amrothemich

Hi @amrothemich, you need to pass your custom device_map when you load your model:

# Load 4-bit quantized model
model = AutoModelForCausalLM.from_pretrained(
    modelpath,    
    device_map=custom_device_map,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
)

LMK if this works !

Mar 13 '24 16:03 SunMarc

Thank you, this did work!

Mar 19 '24 19:03 amrothemich

accelerate accelerate copied to clipboard

Multi GPU with custom device map and 4bit bnb quant

System Info

Information

Tasks

Reproduction

Expected behavior

accelerate
accelerate copied to clipboard