GaLore Third-party benchmark

Hello, thank you very much for such excellent work. We have conducted some experiments using Llama-Factory, and the results indicate that Galore can significantly reduce memory usage during full parameter fine-tuning. We utilized the 8-bit AdamW optimizer and pure bfloat16 training with gradient checkpointing. Galore requires only 18GB of VRAM to train a Llama-2 7B model, while the standard 8-bit AdamW optimizer requires at least 40GB of VRAM. We provide reproducible scripts for SFT training here: https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/extras/galore/galore_adamw_8bit_bf16.sh

	Rank	Retain grad	Memory	Token/s
8-bit AdamW		Yes	40GB	1434
8-bit GaLore	16	Yes	28GB	1532
8-bit GaLore	128	Yes	29GB	1532
16-bit GaLore	128	Yes	30GB	1615
16-bit GaLore	128	No	18GB	1587
8-bit GaLore	1024	Yes	36GB	1238

* We omitted the time of computing SVD for GaLore every update_proj_gap step, it costs around 10 minutes for a 7B model.

model: LLaMA-2 7B
device: NVIDIA A100
token batch size: 512
activation checkpointing: enabled
flash attention: disabled

Experiment results last updated: Mar 9th. todo: add loss convergence results.

Mar 07 '24 17:03 hiyouga

Hello, thank you very much for such excellent work. We have conducted some experiments using Llama-Factory, and the results indicate that Galore can significantly reduce memory usage during full parameter fine-tuning. We utilized the 8-bit AdamW optimizer and pure bfloat16 training with gradient checkpointing. Galore requires only 28GB of VRAM to train a Llama-2 7B model, while the standard 8-bit AdamW optimizer requires at least 42GB of VRAM. Galore also demonstrates superior training speed, achieving about 130% of the throughput. We provide reproducible scripts for SFT training here: https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/extras/galore/galore_adamw_8bit_bf16.sh

8-bit AdamW GaLore 8-bit AdamW GRAM 28GB 42GB Speed 1.14 it/s 1.59 it/s

Thank you for sharing! have you checked accuracy benchmarks too?

Mar 07 '24 17:03 samuelazran

@samuelazran nope, but the loss curve is pretty good for me

Mar 07 '24 17:03 hiyouga

@hiyouga It would be interesting to benchmark some state-of-the-art LLMs on a few tasks from the LLM leaderboard. The accuracy reported on the GLUE benchmark using pre-trained RoBERTa-Base doesn't seem to be increasing by a large margin

Mar 07 '24 18:03 monk1337

GaLore is way better than LoRA and its variants in terms of loss, based on my small-scale experiment, even though this was not a rigorous study and only serve as a preliminary test.

Mar 08 '24 02:03 Larryvrh

Hi @hiyouga, I am trying out GaLore with this repo. However, I am experiencing a very low throughput on an A6000. How did you manage to make it >1it/s? In addition, if I understand correctly, GaGlore reduces O(N) operations (element-wise scaling) but adds more O(N^3) operations (SVD and projections) upon Adam-8bit, how is it faster instead?

Mar 08 '24 06:03 yongchanghao

@yongchanghao Sorry, we might miss some experimental details. We used gradient_accumulation_steps=2 in the above experiments. We advise using a larger batch size but smaller gradient accumulation steps for better throughput. Regarding the complexity issue, SVD is only performed once every update_proj_gap steps, so GaLore performs better in most cases during training.

Mar 08 '24 11:03 hiyouga

@yongchanghao Sorry, we might miss some experimental details. We used gradient_accumulation_steps=4 in the above experiments. Using the provided script (with ga=8) reduces the throughput by half. We advise using a larger batch size but smaller gradient accumulation steps for better throughput. Regarding the complexity issue, SVD is only performed once every update_proj_gap steps, so GaLore performs better in most cases during training.

@hiyouga I'm also confused why GaLore can improve throughput without increasing batch_size. Actually, in the paper it mentioned "which induces 17% overhead compared to 8-bit Adam implementation." And Table 8 shows that 8-bit GaLore is slower than Adam8bit.

Mar 08 '24 12:03 pkumc

@pkumc The previous results were somewhat unfair indeed. Now we have adjusted the experimental setup and updated the results. When the rank is small (<128), GaLore still has better throughput. I guess it may be because GaLore has fewer FLOPs in training. Regarding the data reported in the paper, we have discussed it with the author, and it may be due to different hardware which has varied GEMM performance.

Mar 08 '24 18:03 hiyouga

@hiyouga Thanks for the update. I feel the current data make more sense.

For future readers' reference, my preliminary experience aligns well with the data reported in in https://github.com/jiaweizzhao/GaLore/issues/3#issuecomment-1985411364

Mar 08 '24 20:03 yongchanghao

Hello everyone! I'm also experimenting with the Galore optimizer and the loss curves look great! But I don't see any benefit in memory usage, Galore even uses a bit more.

I compare a TinyLlama full parameter finetune, same batch size and learning rate, FA2 enabled, GaLoreAdamW8bit (🟢) versus adamw_8bit (🔵). Not sure if the optimizer parameters scale=1, rank=1024, and update_proj_gap=200 I picked make sense. Loss looks good with these settings, but

GaLoreAdamW8bit should use less memory right?
and, should it not also be faster?

Code:

use_galore = True
lr = 1e-5
modelpath = "../models/TinyLlama-1.1B-intermediate-step-1431k-3T"
set_seed(42)
run_id = f"use_galore-{use_galore}-{str(uuid.uuid4())}"

model = AutoModelForCausalLM.from_pretrained(
    modelpath,    
    torch_dtype=torch.bfloat16,
    attn_implementation = "flash_attention_2",  
    use_cache = False,
)
tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast = False)
train_dataset = load_dataset('imdb', split='train')

def load_galore_optimizer(model, target_modules_list=["attn", "mlp"]):
    galore_params = []
    for module_name, module in model.named_modules():
        if not isinstance(module, nn.Linear): continue
        if not any(target_key in module_name for target_key in target_modules_list): continue
        galore_params.append(module.weight)
        print(module_name)
    id_galore_params = {id(p) for p in galore_params}
    regular_params = [p for p in model.parameters() if id(p) not in id_galore_params]

    param_groups = [
        dict(params=regular_params),
        dict(
            params=galore_params,
            rank=1024,
            update_proj_gap=200,
            scale=1,
            proj_type="std",
        ),
    ]
    optimizer = GaLoreAdamW8bit(param_groups, lr=lr)
    scheduler = get_constant_schedule(optimizer)
    return optimizer, scheduler

args = TrainingArguments(
    output_dir = run_id,
    optim = "adamw_8bit",
    logging_steps = 1, 
    max_steps = 100,
    per_device_train_batch_size = 16,
    learning_rate = lr,
    lr_scheduler_type = "constant",
    gradient_checkpointing = True,
)

trainer = SFTTrainer(
    model = model, 
    train_dataset = train_dataset,
    dataset_text_field = 'text',
    max_seq_length = 512,
    optimizers = load_galore_optimizer(model) if use_galore else (None, None),
    args = args,
)
trainer.train()

What might be the issue here? Any input highly appreciated!

Mar 13 '24 20:03 geronimi73

Hello everyone! I'm also experimenting with the Galore optimizer and the loss curves look great! But I don't see any benefit in memory usage, Galore even uses a bit more.

I compare a TinyLlama full parameter finetune, same batch size and learning rate, FA2 enabled, GaLoreAdamW8bit (🟢) versus adamw_8bit (🔵). Not sure if the optimizer parameters scale=1, rank=1024, and update_proj_gap=200 I picked make sense but the loss looks good at least. Memory should be less with GaLoreAdamW8bit and should it not also be faster?

Code:

use_galore = True
lr = 1e-5
modelpath = "../models/TinyLlama-1.1B-intermediate-step-1431k-3T"
set_seed(42)
run_id = f"use_galore-{use_galore}-{str(uuid.uuid4())}"

model = AutoModelForCausalLM.from_pretrained(
    modelpath,    
    torch_dtype=torch.bfloat16,
    attn_implementation = "flash_attention_2",  
    use_cache = False,
)
tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast = False)
train_dataset = load_dataset('imdb', split='train')

def load_galore_optimizer(model, target_modules_list=["attn", "mlp"]):
    galore_params = []
    for module_name, module in model.named_modules():
        if not isinstance(module, nn.Linear): continue
        if not any(target_key in module_name for target_key in target_modules_list): continue
        galore_params.append(module.weight)
        print(module_name)
    id_galore_params = {id(p) for p in galore_params}
    regular_params = [p for p in model.parameters() if id(p) not in id_galore_params]

    param_groups = [
        dict(params=regular_params),
        dict(
            params=galore_params,
            rank=1024,
            update_proj_gap=200,
            scale=1,
            proj_type="std",
        ),
    ]
    optimizer = GaLoreAdamW8bit(param_groups, lr=lr)
    scheduler = get_constant_schedule(optimizer)
    return optimizer, scheduler

args = TrainingArguments(
    output_dir = run_id,
    optim = "adamw_8bit",
    logging_steps = 1, 
    max_steps = 100,
    per_device_train_batch_size = 16,
    learning_rate = lr,
    lr_scheduler_type = "constant",
    gradient_checkpointing = True,
)

trainer = SFTTrainer(
    model = model, 
    train_dataset = train_dataset,
    dataset_text_field = 'text',
    max_seq_length = 512,
    optimizers = load_galore_optimizer(model) if use_galore else (None, None),
    args = args,
)
trainer.train()

What might be the issue here? Any input highly appreciated!

I believe you need do galore layer by layer in order to save memory, as in https://github.com/jiaweizzhao/GaLore/blob/a6bc1650984b1c090a4e108d7c0e3109ee7ad844/torchrun_main.py#L334

Mar 14 '24 00:03 Larryvrh

GaLore is way better than LoRA and its variants in terms of loss, based on my small-scale experiment, even though this was not a rigorous study and only serve as a preliminary test.

Hello, I'd like to know what data and model you used to achieve this effect?

Mar 21 '24 15:03 Leosgp

This is not a formal research. Although Galore reduces the amount of memory used, it is undeniable that Galore increases the training time by a factor of three. The increase in time is not friendly to LLM training.

This is test code:

'''
#install
conda create --name test python=3.11
conda activate test

export CUDA_HOME=xxxxxxx
export LD_LIBRARY_PATH=$CUDA_HOME"/lib64:$LD_LIBRARY_PATH"
export PATH=$CUDA_HOME"/bin:$PATH"
pip install -U transformers trl datasets
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install galore-torch

HF support optimizer
['adamw_hf', 'adamw_torch', 'adamw_torch_fused', 'adamw_torch_xla', 'adamw_torch_npu_fused', 'adamw_apex_fused', 'adafactor', 'adamw_anyprecision', 'sgd', 'adagrad', 'adamw_bnb_8bit', 'adamw_8bit', 'lion_8bit', 'lion_32bit', 
'paged_adamw_32bit', 'paged_adamw_8bit', 'paged_lion_32bit', 'paged_lion_8bit', 'rmsprop', 'rmsprop_bnb', 'rmsprop_bnb_8bit', 'rmsprop_bnb_32bit', 
'galore_adamw', 'galore_adamw_8bit', 'galore_adafactor', 
'galore_adamw_layerwise', 'galore_adamw_8bit_layerwise', 'galore_adafactor_layerwise']

'''
import torch
import datasets
from transformers import TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForCausalLM
import trl, time

train_dataset = datasets.load_dataset('imdb', split='train')

args = TrainingArguments(
    output_dir="./test-galore",
    max_steps=100,
    per_device_train_batch_size=2,
    optim="adamw_hf",
    optim_target_modules=["attn", "mlp"]
)

model_id = "Qwen/Qwen1.5-0.5B"   
#model_id = "Qwen/Qwen1.5-4B"
#model_id = "Qwen/Qwen1.5-7B"
#model_id = "mistralai/Mistral-7B-v0.1"

config = AutoConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_config(config).to(0)

trainer = trl.SFTTrainer(
    model=model, 
    args=args,
    train_dataset=train_dataset,
    dataset_text_field='text',
    max_seq_length=512,
)

start_time = time.time()
trainer.train()
train_time = time.time()-start_time

print(f"=====================================================")
print(f"Time Used: {train_time:.2f} s")
print(f"memory_allocated: {torch.cuda.memory_allocated()/1024.0/1024.0:.2f} MB")
print(f"max_memory_allocated: {torch.cuda.max_memory_allocated()/1024.0/1024.0:.2f} MB")
print(f"memory_reserved: {torch.cuda.memory_reserved()/1024.0/1024.0:.2f} MB")
print(f"max_memory_reserved: {torch.cuda.max_memory_reserved()/1024.0/1024.0:.2f} MB")
print(f"free memory: {torch.cuda.mem_get_info()[0]/1024.0/1024.0:.2f} MB")
print(f"=====================================================")

Apr 05 '24 10:04 WangRongsheng

Thanks for providing your results @WangRongsheng . We are working on efficiency optimization and you can expect a big throughput boost in the next version. For train_loss, did you tune lr for GaLore?

Apr 05 '24 19:04 jiaweizzhao

Thanks for providing your results @WangRongsheng . We are working on efficiency optimization and you can expect a big throughput boost in the next version. For train_loss, did you tune lr for GaLore?

I will do it.

Apr 06 '24 12:04 WangRongsheng

GaLore GaLore copied to clipboard

Third-party benchmark

GaLore
GaLore copied to clipboard