litgpt Higher memory use with QLoRA

Changing only 1 line the config file, that is

quantize: bnb.nf4

Increased the memory usage from 14 GB -> 18 GB.

Epoch 5 | iter 965 step 965 | loss train: 1.182, val: 1.057 | iter time: 743.31 ms (step)
Training time: 583.36s
Memory used: 14.49 GB
Saving LoRA weights to 'out/finetune/lora-tiny-llama-1.1b/final/lit_model.pth.lora'

Epoch 5 | iter 965 step 965 | loss train: 1.201, val: 1.078 | iter time: 812.38 ms (step)
Training time: 622.99s
Memory used: 18.15 GB
Saving LoRA weights to 'out/finetune/qlora-tiny-llama-1.1b/final/lit_model.pth.lora'

Is this perhaps expected with smaller models due to some BnB inefficiency?

qlora.yaml


# The path to the base model's checkpoint directory to load for finetuning. (type: <class 'Path'>, default: checkpoints/stabilityai/stablelm-base-alpha-3b)
checkpoint_dir: checkpoints/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T

# Directory in which to save checkpoints and logs. (type: <class 'Path'>, default: out/lora)
out_dir: out/finetune/qlora-tiny-llama-1.1b

# The precision to use for finetuning. Possible choices: "bf16-true", "bf16-mixed", "32-true". (type: Optional[str], default: null)
precision: bf16-true

# If set, quantize the model with this algorithm. See ``tutorials/quantize.md`` for more information. (type: Optional[Literal['nf4', 'nf4-dq', 'fp4', 'fp4-dq', 'int8-training']], default: null)
quantize: bnb.nf4

# How many devices/GPUs to use. (type: Union[int, str], default: 1)
devices: 1

# The LoRA rank. (type: int, default: 8)
lora_r: 32

# The LoRA alpha. (type: int, default: 16)
lora_alpha: 16

# The LoRA dropout value. (type: float, default: 0.05)
lora_dropout: 0.05

# Whether to apply LoRA to the query weights in attention. (type: bool, default: True)
lora_query: true

# Whether to apply LoRA to the key weights in attention. (type: bool, default: False)
lora_key: false

# Whether to apply LoRA to the value weights in attention. (type: bool, default: True)
lora_value: true

# Whether to apply LoRA to the output projection in the attention block. (type: bool, default: False)
lora_projection: false

# Whether to apply LoRA to the weights of the MLP in the attention block. (type: bool, default: False)
lora_mlp: false

# Whether to apply LoRA to output head in GPT. (type: bool, default: False)
lora_head: false

# Data-related arguments. If not provided, the default is ``litgpt.data.Alpaca``.
data:
  class_path: litgpt.data.Alpaca2k
  init_args:
    mask_prompt: false
    val_split_fraction: 0.03847
    prompt_style: alpaca
    ignore_index: -100
    seed: 42
    num_workers: 4

# Training-related arguments. See ``litgpt.args.TrainArgs`` for details
train:

  # Number of optimizer steps between saving checkpoints (type: Optional[int], default: 1000)
  save_interval: 800

  # Number of iterations between logging calls (type: int, default: 1)
  log_interval: 1

  # Number of samples between optimizer steps across data-parallel ranks (type: int, default: 128)
  global_batch_size: 8

  # Number of samples per data-parallel rank (type: int, default: 4)
  micro_batch_size: 8

  # Number of iterations with learning rate warmup active (type: int, default: 100)
  lr_warmup_steps: 10

  # Number of epochs to train on (type: Optional[int], default: 5)
  epochs: 4

  # Total number of tokens to train on (type: Optional[int], default: null)
  max_tokens:

  # Limits the number of optimizer steps to run. (type: Optional[int], default: null)
  max_steps:

  # Limits the length of samples. Off by default (type: Optional[int], default: null)
  max_seq_length: 4096

  # Whether to tie the embedding weights with the language modeling head weights. (type: Optional[bool], default: null)
  tie_embeddings:

  #   (type: float, default: 0.0003)
  learning_rate: 0.0002

  #   (type: float, default: 0.02)
  weight_decay: 0.0

  #   (type: float, default: 0.9)
  beta1: 0.9

  #   (type: float, default: 0.95)
  beta2: 0.95

  #   (type: Optional[float], default: null)
  max_norm:

  #   (type: float, default: 6e-05)
  min_lr: 6.0e-05

# Evaluation-related arguments. See ``litgpt.args.EvalArgs`` for details
eval:

  # Number of optimizer steps between evaluation calls (type: int, default: 100)
  interval: 100

  # Number of tokens to generate (type: Optional[int], default: 100)
  max_new_tokens: 100

  # Number of iterations (type: int, default: 100)
  max_iters: 100

# The name of the logger to send metrics to. (type: Literal['wandb', 'tensorboard', 'csv'], default: csv)
logger_name: csv

# The random seed to use for reproducibility. (type: int, default: 1337)
seed: 1337

The lora.yaml is exactly the same except changing quantize: bnb.nf4 to quantize: .

Mar 13 '24 22:03 rasbt

Can you share the hparams printed in the command line? In case this is a parsing issue

Mar 13 '24 22:03 carmocca

Sure, here is it with LoRA:

{'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T'),
 'data': Alpaca2k(mask_prompt=False, val_split_fraction=0.03847, prompt_style=<litgpt.prompts.Alpaca object at 0x7f1e24a89d90>, ignore_index=-100, seed=42, num_workers=4, download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=100, max_new_tokens=100, max_iters=100),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.05,
 'lora_head': False,
 'lora_key': False,
 'lora_mlp': False,
 'lora_projection': False,
 'lora_query': True,
 'lora_r': 32,
 'lora_value': True,
 'out_dir': PosixPath('out/finetune/lora-tiny-llama-1.1b'),
 'precision': 'bf16-true',
 'quantize': None,
 'seed': 1337,
 'train': TrainArgs(save_interval=800, log_interval=1, global_batch_size=8, micro_batch_size=8, lr_warmup_steps=10, epochs=4, max_tokens=None, max_steps=None, max_seq_length=512, tie_embeddings=None, learning_rate=0.0002, weight_decay=0.0, beta1=0.9, beta2=0.95, max_norm=None, min_lr=6e-05)}

and here with QLoRA:

{'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T'),
 'data': Alpaca2k(mask_prompt=False, val_split_fraction=0.03847, prompt_style=<litgpt.prompts.Alpaca object at 0x7faef18474f0>, ignore_index=-100, seed=42, num_workers=4, download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=100, max_new_tokens=100, max_iters=100),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.05,
 'lora_head': False,
 'lora_key': False,
 'lora_mlp': False,
 'lora_projection': False,
 'lora_query': True,
 'lora_r': 32,
 'lora_value': True,
 'out_dir': PosixPath('out/finetune/qlora-tiny-llama-1.1b'),
 'precision': 'bf16-true',
 'quantize': 'bnb.nf4',
 'seed': 1337,
 'train': TrainArgs(save_interval=800, log_interval=1, global_batch_size=8, micro_batch_size=8, lr_warmup_steps=10, epochs=4, max_tokens=None, max_steps=None, max_seq_length=512, tie_embeddings=None, learning_rate=0.0002, weight_decay=0.0, beta1=0.9, beta2=0.95, max_norm=None, min_lr=6e-05)}

Mar 13 '24 22:03 rasbt

Hmm, I tried again and for tinyllama that's true. But for llama 7b it was significantly lower.

# Finetune Llama 2 7B (22.45 GB)
# litgpt finetune lora --config configs/llama-2-7b/lora.yaml

# QLoRA (14.5 GB)
# litgpt finetune lora --config configs/llama-2-7b/qlora.yaml


# Finetune TinyLlama (12.1 GB)
# litgpt finetune lora --config configs/tinyllama.yaml

# 15 GB
litgpt finetune lora --config configs/tinyllama.yaml --quantize bnb.nf4

Need to investigate why this is model-specific

Mar 14 '24 13:03 awaelchli

Maybe the next thing to investigate here is to try bnb 0.42.0 or 0.43.0. I ran this with 0.41.0 above.

Mar 14 '24 13:03 rasbt