litgpt
litgpt copied to clipboard
Hyperparameters used in LoRA not consistent with those of Adapter fine-tuning
Hi, below are the hyperparameters defined under finetune/adapter.py
# Hyperparameters
learning_rate = 3e-3
batch_size = 64 / devices
micro_batch_size = 4
gradient_accumulation_iters = batch_size // micro_batch_size
assert gradient_accumulation_iters > 0
epoch_size = 50000 # train dataset size
num_epochs = 5
max_iters = num_epochs * (epoch_size // micro_batch_size) // devices
weight_decay = 0.02
warmup_steps = 2 * (epoch_size // micro_batch_size) // devices // gradient_accumulation_steps # 2 epochs
and finetune/lora.py
# Hyperparameters
learning_rate = 3e-4
batch_size = 128
micro_batch_size = 4
gradient_accumulation_iters = batch_size // micro_batch_size
assert gradient_accumulation_iters > 0
max_iters = 50000 # train dataset size
weight_decay = 0.01
lora_r = 8
lora_alpha = 16
lora_dropout = 0.05
warmup_steps = 100
Issue
The role of certain hyperparameters used in LoRA isn't very clear. Ex:
- max_iters per device in Adapter is a logical calculation based on num_epochs,
epoch_size (set to train data size)& micro_batch_size. However, max_iters in LoRA is directly set to train data size. - batch_size is n/devices in Adapter while it's just n in LoRA
- similar for warmup_steps
I understand points 2 & 3 are subjective & calculations could be already assumed but does it seems like max_iters set for LoRA is not proper?