accelerate
accelerate copied to clipboard
Multi GPU with custom device map and 4bit bnb quant
System Info
Copy-and-paste the text below in your GitHub issue
- `Accelerate` version: 0.28.0
- Platform: Linux-5.15.0-1056-azure-x86_64-with-glibc2.35
- Python version: 3.10.12
- Numpy version: 1.23.5
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 212.59 GB
- GPU type: Tesla V100-PCIE-16GB
- `Accelerate` default config:
Not found
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - [X] My own task or dataset (give details below)
Reproduction
I'm tuning a Mistral qlora quant (via peft and bitsandbytes) on 2 GPUs. I want to force the model to split across 2 devices, because it's big enough to load on one (so balanced/sequential won't work), but immediately OOMs when tuning. When I try to specify as below (which is based on a map I get when setting the max_memory arbitrarily low and using infer_auto_device_map()
) I get the following error:
ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
Here's the code:
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from accelerate import dispatch_model
modelpath="BioMistral/BioMistral-7B"
# # Load (slow) Tokenizer, fast tokenizer sometimes ignores added tokens
tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast=False)
tokenizer.pad_token = tokenizer.eos_token
from transformers import BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16
)
# Load 4-bit quantized model
model = AutoModelForCausalLM.from_pretrained(
modelpath,
device_map='sequential',
quantization_config=bnb_config,
torch_dtype=torch.bfloat16,
)
# Add LoRA adapters to
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
config = LoraConfig(
r=128,
lora_alpha=64,
target_modules = [
"q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
model.config.use_cache = False
device_map = {
"base_model.model.model.embed_tokens": 0,
"base_model.model.model.layers.0": 0,
"base_model.model.model.layers.1": 0,
"base_model.model.model.layers.2": 0,
"base_model.model.model.layers.3": 0,
"base_model.model.model.layers.4": 0,
"base_model.model.model.layers.5": 0,
"base_model.model.model.layers.6": 0,
"base_model.model.model.layers.7": 0,
"base_model.model.model.layers.8": 0,
"base_model.model.model.layers.9": 0,
"base_model.model.model.layers.10": 0,
"base_model.model.model.layers.11": 0,
"base_model.model.model.layers.12": 0,
"base_model.model.model.layers.13": 0,
"base_model.model.model.layers.14": 0,
"base_model.model.model.layers.15": 0,
"base_model.model.model.layers.16": 0,
"base_model.model.model.layers.17": 0,
"base_model.model.model.layers.18": 0,
"base_model.model.model.layers.19": 1,
"base_model.model.model.layers.20": 1,
"base_model.model.model.layers.21": 1,
"base_model.model.model.layers.22": 1,
"base_model.model.model.layers.23": 1,
"base_model.model.model.layers.24": 1,
"base_model.model.model.layers.25": 1,
"base_model.model.model.layers.26": 1,
"base_model.model.model.layers.27": 1,
"base_model.model.model.layers.28": 1,
"base_model.model.model.layers.29": 1,
"base_model.model.model.layers.30": 1,
"base_model.model.model.layers.31": 1,
"base_model.model.model.norm": 1,
"base_model.model.lm_head": 1,
"base_model.model.model.layers.19.mlp": 1
}
model = dispatch_model(model, device_map=device_map)
Expected behavior
I want the model to be split across the GPUs as described in the device_map.
Hi @amrothemich, you need to pass your custom device_map
when you load your model:
# Load 4-bit quantized model
model = AutoModelForCausalLM.from_pretrained(
modelpath,
device_map=custom_device_map,
quantization_config=bnb_config,
torch_dtype=torch.bfloat16,
)
LMK if this works !
Thank you, this did work!