OOM on 2x24GB GPU with a 30B model
On a 2x 3090 Ubuntu system, I am trying to run a QLoRA finetune on a 30B model WizardLM-30B-fp16 by splitting the single model across both cards because I think it cannot fit into a single card. Both cards have 0.315 GB VRAM in use before running the finetune.
I have been reading how 30B models can be trained on a single 24GB VRAM GPU like the 3090. Why am I only able to train it with a per-device batch size of 1 using a minimum of two 24GB GPUs? This is going to take over a month.
model_id = "TheBloke/WizardLM-30B-fp16"
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = LlamaTokenizer.from_pretrained(model_id)
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=nf4_config,
device_map="auto",
)
model.gradient_checkpointing_enable()
model.config.use_cache = False
model.resize_token_embeddings(len(tokenizer))
config = LoraConfig(
r=256,
lora_alpha=512,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, config)
trainable params: 408944640 || all params: 16886824448 || trainable%: 2.4216787546958454
Using a single card (2nd GPU):
per_device_train_batch_size=1
TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
**more_training_args,
)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 794.00 MiB (GPU 0; 23.69 GiB total capacity; 21.16 GiB already allocated; 23.06 MiB free; 22.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
per_device_train_batch_size=2
TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=2,
**more_training_args,
)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.55 GiB (GPU 0; 23.69 GiB total capacity; 20.11 GiB already allocated; 297.06 MiB free; 22.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Using both cards:
per_device_train_batch_size=1
TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
**more_training_args,
)
Works!
GPU0 MEM 22.0/24.0 GB
GPU1 MEM 22.5/24.0 GB
per_device_train_batch_size=2
TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=2,
**more_training_args,
)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.55 GiB (GPU 1; 23.69 GiB total capacity; 19.56 GiB already allocated; 749.06 MiB free; 21.94 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Your LoRA rank might be too high (r = 128). I wouldn't recommend going above the effective batch size of 1 either, it seems to negatively affect the train loss with QLoRA.
@AlpinDale Is the effective batch size equal to the value of per_device_train_batch_size?
@AlpinDale Is the effective batch size equal to the value of
per_device_train_batch_size?
Effective batch size is equal to per_device_train_batch_size * gradient_accumulation_steps.
So @AlpinDale you are saying both of those settings should be set to 1?
@Tostino Yes. Keep in mind though that an effective batch size of 1 results in a very slow training time.
What happens if you only lower LoRA rank? can you use larger batch sizes then?