On a 2x 3090 Ubuntu system, I am trying to run a QLoRA finetune on a 30B model WizardLM-30B-fp16 by splitting the single model across both cards because I think it cannot fit into a single card. Both cards have 0.315 GB VRAM in use before running the finetune.

I have been reading how 30B models can be trained on a single 24GB VRAM GPU like the 3090. Why am I only able to train it with a per-device batch size of 1 using a minimum of two 24GB GPUs? This is going to take over a month.

model_id = "TheBloke/WizardLM-30B-fp16"
nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = LlamaTokenizer.from_pretrained(model_id)
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=nf4_config,
    device_map="auto", 
)
model.gradient_checkpointing_enable()
model.config.use_cache = False
model.resize_token_embeddings(len(tokenizer))

config = LoraConfig(
    r=256,
    lora_alpha=512,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, config)

trainable params: 408944640 || all params: 16886824448 || trainable%: 2.4216787546958454

Using a single card (2nd GPU):

`per_device_train_batch_size=1`

TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    **more_training_args,
)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 794.00 MiB (GPU 0; 23.69 GiB total capacity; 21.16 GiB already allocated; 23.06 MiB free; 22.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

`per_device_train_batch_size=2`

TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    **more_training_args,
)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.55 GiB (GPU 0; 23.69 GiB total capacity; 20.11 GiB already allocated; 297.06 MiB free; 22.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Using both cards:

`per_device_train_batch_size=1`

TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    **more_training_args,
)

Works!

GPU0 MEM 22.0/24.0 GB
GPU1 MEM 22.5/24.0 GB

`per_device_train_batch_size=2`

TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    **more_training_args,
)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.55 GiB (GPU 1; 23.69 GiB total capacity; 19.56 GiB already allocated; 749.06 MiB free; 21.94 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Jun 22 '23 17:06 gameveloster

Your LoRA rank might be too high (r = 128). I wouldn't recommend going above the effective batch size of 1 either, it seems to negatively affect the train loss with QLoRA.

Jun 23 '23 03:06 AlpinDale

@AlpinDale Is the effective batch size equal to the value of per_device_train_batch_size?

Jun 23 '23 12:06 gameveloster

@AlpinDale Is the effective batch size equal to the value of per_device_train_batch_size?

Effective batch size is equal to per_device_train_batch_size * gradient_accumulation_steps.

Jun 23 '23 13:06 AlpinDale

So @AlpinDale you are saying both of those settings should be set to 1?

Jun 23 '23 16:06 Tostino

@Tostino Yes. Keep in mind though that an effective batch size of 1 results in a very slow training time.

Jun 23 '23 16:06 AlpinDale

What happens if you only lower LoRA rank? can you use larger batch sizes then?

Jul 20 '23 14:07 hans-ekbrand

OOM on 2x24GB GPU with a 30B model

Using a single card (2nd GPU):

per_device_train_batch_size=1

per_device_train_batch_size=2

Using both cards:

per_device_train_batch_size=1

per_device_train_batch_size=2

`per_device_train_batch_size=1`

`per_device_train_batch_size=2`

`per_device_train_batch_size=1`

`per_device_train_batch_size=2`