Hyperparameters used in LoRA not consistent with those of Adapter fine-tuning

Open ht0rohit opened this issue 2 years ago • 0 comments

Hi, below are the hyperparameters defined under finetune/adapter.py

# Hyperparameters
learning_rate = 3e-3
batch_size = 64 / devices
micro_batch_size = 4
gradient_accumulation_iters = batch_size // micro_batch_size
assert gradient_accumulation_iters > 0
epoch_size = 50000  # train dataset size
num_epochs = 5
max_iters = num_epochs * (epoch_size // micro_batch_size) // devices
weight_decay = 0.02
warmup_steps = 2 * (epoch_size // micro_batch_size) // devices // gradient_accumulation_steps # 2 epochs

and finetune/lora.py

# Hyperparameters
learning_rate = 3e-4
batch_size = 128
micro_batch_size = 4
gradient_accumulation_iters = batch_size // micro_batch_size
assert gradient_accumulation_iters > 0
max_iters = 50000  # train dataset size
weight_decay = 0.01
lora_r = 8
lora_alpha = 16
lora_dropout = 0.05
warmup_steps = 100

Issue

The role of certain hyperparameters used in LoRA isn't very clear. Ex:

max_iters per device in Adapter is a logical calculation based on num_epochs, epoch_size (set to train data size) & micro_batch_size. However, max_iters in LoRA is directly set to train data size.
batch_size is n/devices in Adapter while it's just n in LoRA
similar for warmup_steps

I understand points 2 & 3 are subjective & calculations could be already assumed but does it seems like max_iters set for LoRA is not proper?

Jul 03 '23 06:07 ht0rohit