GaLore
GaLore copied to clipboard
Third-party benchmark
Hello, thank you very much for such excellent work. We have conducted some experiments using Llama-Factory, and the results indicate that Galore can significantly reduce memory usage during full parameter fine-tuning. We utilized the 8-bit AdamW optimizer and pure bfloat16 training with gradient checkpointing. Galore requires only 18GB of VRAM to train a Llama-2 7B model, while the standard 8-bit AdamW optimizer requires at least 40GB of VRAM. We provide reproducible scripts for SFT training here: https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/extras/galore/galore_adamw_8bit_bf16.sh
Rank | Retain grad | Memory | Token/s | |
---|---|---|---|---|
8-bit AdamW | Yes | 40GB | 1434 | |
8-bit GaLore | 16 | Yes | 28GB | 1532 |
8-bit GaLore | 128 | Yes | 29GB | 1532 |
16-bit GaLore | 128 | Yes | 30GB | 1615 |
16-bit GaLore | 128 | No | 18GB | 1587 |
8-bit GaLore | 1024 | Yes | 36GB | 1238 |
* We omitted the time of computing SVD for GaLore every update_proj_gap
step, it costs around 10 minutes for a 7B model.
- model: LLaMA-2 7B
- device: NVIDIA A100
- token batch size: 512
- activation checkpointing: enabled
- flash attention: disabled
Experiment results last updated: Mar 9th. todo: add loss convergence results.
Hello, thank you very much for such excellent work. We have conducted some experiments using Llama-Factory, and the results indicate that Galore can significantly reduce memory usage during full parameter fine-tuning. We utilized the 8-bit AdamW optimizer and pure bfloat16 training with gradient checkpointing. Galore requires only 28GB of VRAM to train a Llama-2 7B model, while the standard 8-bit AdamW optimizer requires at least 42GB of VRAM. Galore also demonstrates superior training speed, achieving about 130% of the throughput. We provide reproducible scripts for SFT training here: https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/extras/galore/galore_adamw_8bit_bf16.sh
8-bit AdamW GaLore 8-bit AdamW GRAM 28GB 42GB Speed 1.14 it/s 1.59 it/s
Thank you for sharing! have you checked accuracy benchmarks too?
@samuelazran nope, but the loss curve is pretty good for me
- @hiyouga It would be interesting to benchmark some state-of-the-art LLMs on a few tasks from the LLM leaderboard. The accuracy reported on the GLUE benchmark using pre-trained RoBERTa-Base doesn't seem to be increasing by a large margin
GaLore is way better than LoRA and its variants in terms of loss, based on my small-scale experiment, even though this was not a rigorous study and only serve as a preliminary test.
Hi @hiyouga, I am trying out GaLore with this repo. However, I am experiencing a very low throughput on an A6000. How did you manage to make it >1it/s? In addition, if I understand correctly, GaGlore reduces O(N) operations (element-wise scaling) but adds more O(N^3) operations (SVD and projections) upon Adam-8bit, how is it faster instead?
@yongchanghao Sorry, we might miss some experimental details. We used gradient_accumulation_steps=2
in the above experiments. We advise using a larger batch size but smaller gradient accumulation steps for better throughput.
Regarding the complexity issue, SVD is only performed once every update_proj_gap
steps, so GaLore performs better in most cases during training.
@yongchanghao Sorry, we might miss some experimental details. We used
gradient_accumulation_steps=4
in the above experiments. Using the provided script (with ga=8) reduces the throughput by half. We advise using a larger batch size but smaller gradient accumulation steps for better throughput. Regarding the complexity issue, SVD is only performed once everyupdate_proj_gap
steps, so GaLore performs better in most cases during training.
@hiyouga I'm also confused why GaLore can improve throughput without increasing batch_size. Actually, in the paper it mentioned "which
induces 17% overhead compared to 8-bit Adam implementation." And Table 8 shows that 8-bit GaLore is slower than Adam8bit.
@pkumc The previous results were somewhat unfair indeed. Now we have adjusted the experimental setup and updated the results. When the rank is small (<128), GaLore still has better throughput. I guess it may be because GaLore has fewer FLOPs in training. Regarding the data reported in the paper, we have discussed it with the author, and it may be due to different hardware which has varied GEMM performance.
@hiyouga Thanks for the update. I feel the current data make more sense.
For future readers' reference, my preliminary experience aligns well with the data reported in in https://github.com/jiaweizzhao/GaLore/issues/3#issuecomment-1985411364
Hello everyone! I'm also experimenting with the Galore optimizer and the loss curves look great! But I don't see any benefit in memory usage, Galore even uses a bit more.
I compare a TinyLlama full parameter finetune, same batch size and learning rate, FA2 enabled, GaLoreAdamW8bit
(🟢) versus adamw_8bit
(🔵). Not sure if the optimizer parameters scale=1
, rank=1024
, and update_proj_gap=200
I picked make sense.
Loss looks good with these settings, but
-
GaLoreAdamW8bit
should use less memory right? - and, should it not also be faster?
Code:
use_galore = True
lr = 1e-5
modelpath = "../models/TinyLlama-1.1B-intermediate-step-1431k-3T"
set_seed(42)
run_id = f"use_galore-{use_galore}-{str(uuid.uuid4())}"
model = AutoModelForCausalLM.from_pretrained(
modelpath,
torch_dtype=torch.bfloat16,
attn_implementation = "flash_attention_2",
use_cache = False,
)
tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast = False)
train_dataset = load_dataset('imdb', split='train')
def load_galore_optimizer(model, target_modules_list=["attn", "mlp"]):
galore_params = []
for module_name, module in model.named_modules():
if not isinstance(module, nn.Linear): continue
if not any(target_key in module_name for target_key in target_modules_list): continue
galore_params.append(module.weight)
print(module_name)
id_galore_params = {id(p) for p in galore_params}
regular_params = [p for p in model.parameters() if id(p) not in id_galore_params]
param_groups = [
dict(params=regular_params),
dict(
params=galore_params,
rank=1024,
update_proj_gap=200,
scale=1,
proj_type="std",
),
]
optimizer = GaLoreAdamW8bit(param_groups, lr=lr)
scheduler = get_constant_schedule(optimizer)
return optimizer, scheduler
args = TrainingArguments(
output_dir = run_id,
optim = "adamw_8bit",
logging_steps = 1,
max_steps = 100,
per_device_train_batch_size = 16,
learning_rate = lr,
lr_scheduler_type = "constant",
gradient_checkpointing = True,
)
trainer = SFTTrainer(
model = model,
train_dataset = train_dataset,
dataset_text_field = 'text',
max_seq_length = 512,
optimizers = load_galore_optimizer(model) if use_galore else (None, None),
args = args,
)
trainer.train()
What might be the issue here? Any input highly appreciated!
Hello everyone! I'm also experimenting with the Galore optimizer and the loss curves look great! But I don't see any benefit in memory usage, Galore even uses a bit more.
I compare a TinyLlama full parameter finetune, same batch size and learning rate, FA2 enabled,
GaLoreAdamW8bit
(🟢) versusadamw_8bit
(🔵). Not sure if the optimizer parametersscale=1
,rank=1024
, andupdate_proj_gap=200
I picked make sense but the loss looks good at least. Memory should be less withGaLoreAdamW8bit
and should it not also be faster?Code:
use_galore = True lr = 1e-5 modelpath = "../models/TinyLlama-1.1B-intermediate-step-1431k-3T" set_seed(42) run_id = f"use_galore-{use_galore}-{str(uuid.uuid4())}" model = AutoModelForCausalLM.from_pretrained( modelpath, torch_dtype=torch.bfloat16, attn_implementation = "flash_attention_2", use_cache = False, ) tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast = False) train_dataset = load_dataset('imdb', split='train') def load_galore_optimizer(model, target_modules_list=["attn", "mlp"]): galore_params = [] for module_name, module in model.named_modules(): if not isinstance(module, nn.Linear): continue if not any(target_key in module_name for target_key in target_modules_list): continue galore_params.append(module.weight) print(module_name) id_galore_params = {id(p) for p in galore_params} regular_params = [p for p in model.parameters() if id(p) not in id_galore_params] param_groups = [ dict(params=regular_params), dict( params=galore_params, rank=1024, update_proj_gap=200, scale=1, proj_type="std", ), ] optimizer = GaLoreAdamW8bit(param_groups, lr=lr) scheduler = get_constant_schedule(optimizer) return optimizer, scheduler args = TrainingArguments( output_dir = run_id, optim = "adamw_8bit", logging_steps = 1, max_steps = 100, per_device_train_batch_size = 16, learning_rate = lr, lr_scheduler_type = "constant", gradient_checkpointing = True, ) trainer = SFTTrainer( model = model, train_dataset = train_dataset, dataset_text_field = 'text', max_seq_length = 512, optimizers = load_galore_optimizer(model) if use_galore else (None, None), args = args, ) trainer.train()
What might be the issue here? Any input highly appreciated!
I believe you need do galore layer by layer in order to save memory, as in https://github.com/jiaweizzhao/GaLore/blob/a6bc1650984b1c090a4e108d7c0e3109ee7ad844/torchrun_main.py#L334
GaLore is way better than LoRA and its variants in terms of loss, based on my small-scale experiment, even though this was not a rigorous study and only serve as a preliminary test.
Hello, I'd like to know what data and model you used to achieve this effect?
This is not a formal research. Although Galore reduces the amount of memory used, it is undeniable that Galore increases the training time by a factor of three. The increase in time is not friendly to LLM training.
This is test code:
'''
#install
conda create --name test python=3.11
conda activate test
export CUDA_HOME=xxxxxxx
export LD_LIBRARY_PATH=$CUDA_HOME"/lib64:$LD_LIBRARY_PATH"
export PATH=$CUDA_HOME"/bin:$PATH"
pip install -U transformers trl datasets
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install galore-torch
HF support optimizer
['adamw_hf', 'adamw_torch', 'adamw_torch_fused', 'adamw_torch_xla', 'adamw_torch_npu_fused', 'adamw_apex_fused', 'adafactor', 'adamw_anyprecision', 'sgd', 'adagrad', 'adamw_bnb_8bit', 'adamw_8bit', 'lion_8bit', 'lion_32bit',
'paged_adamw_32bit', 'paged_adamw_8bit', 'paged_lion_32bit', 'paged_lion_8bit', 'rmsprop', 'rmsprop_bnb', 'rmsprop_bnb_8bit', 'rmsprop_bnb_32bit',
'galore_adamw', 'galore_adamw_8bit', 'galore_adafactor',
'galore_adamw_layerwise', 'galore_adamw_8bit_layerwise', 'galore_adafactor_layerwise']
'''
import torch
import datasets
from transformers import TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForCausalLM
import trl, time
train_dataset = datasets.load_dataset('imdb', split='train')
args = TrainingArguments(
output_dir="./test-galore",
max_steps=100,
per_device_train_batch_size=2,
optim="adamw_hf",
optim_target_modules=["attn", "mlp"]
)
model_id = "Qwen/Qwen1.5-0.5B"
#model_id = "Qwen/Qwen1.5-4B"
#model_id = "Qwen/Qwen1.5-7B"
#model_id = "mistralai/Mistral-7B-v0.1"
config = AutoConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_config(config).to(0)
trainer = trl.SFTTrainer(
model=model,
args=args,
train_dataset=train_dataset,
dataset_text_field='text',
max_seq_length=512,
)
start_time = time.time()
trainer.train()
train_time = time.time()-start_time
print(f"=====================================================")
print(f"Time Used: {train_time:.2f} s")
print(f"memory_allocated: {torch.cuda.memory_allocated()/1024.0/1024.0:.2f} MB")
print(f"max_memory_allocated: {torch.cuda.max_memory_allocated()/1024.0/1024.0:.2f} MB")
print(f"memory_reserved: {torch.cuda.memory_reserved()/1024.0/1024.0:.2f} MB")
print(f"max_memory_reserved: {torch.cuda.max_memory_reserved()/1024.0/1024.0:.2f} MB")
print(f"free memory: {torch.cuda.mem_get_info()[0]/1024.0/1024.0:.2f} MB")
print(f"=====================================================")
Thanks for providing your results @WangRongsheng . We are working on efficiency optimization and you can expect a big throughput boost in the next version. For train_loss, did you tune lr for GaLore?
Thanks for providing your results @WangRongsheng . We are working on efficiency optimization and you can expect a big throughput boost in the next version. For train_loss, did you tune lr for GaLore?
I will do it.