GaLore icon indicating copy to clipboard operation
GaLore copied to clipboard

Third-party benchmark

Open hiyouga opened this issue 3 months ago • 15 comments

Hello, thank you very much for such excellent work. We have conducted some experiments using Llama-Factory, and the results indicate that Galore can significantly reduce memory usage during full parameter fine-tuning. We utilized the 8-bit AdamW optimizer and pure bfloat16 training with gradient checkpointing. Galore requires only 18GB of VRAM to train a Llama-2 7B model, while the standard 8-bit AdamW optimizer requires at least 40GB of VRAM. We provide reproducible scripts for SFT training here: https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/extras/galore/galore_adamw_8bit_bf16.sh

Rank Retain grad Memory Token/s
8-bit AdamW Yes 40GB 1434
8-bit GaLore 16 Yes 28GB 1532
8-bit GaLore 128 Yes 29GB 1532
16-bit GaLore 128 Yes 30GB 1615
16-bit GaLore 128 No 18GB 1587
8-bit GaLore 1024 Yes 36GB 1238

* We omitted the time of computing SVD for GaLore every update_proj_gap step, it costs around 10 minutes for a 7B model.

  • model: LLaMA-2 7B
  • device: NVIDIA A100
  • token batch size: 512
  • activation checkpointing: enabled
  • flash attention: disabled

Experiment results last updated: Mar 9th. todo: add loss convergence results.

image

hiyouga avatar Mar 07 '24 17:03 hiyouga

Hello, thank you very much for such excellent work. We have conducted some experiments using Llama-Factory, and the results indicate that Galore can significantly reduce memory usage during full parameter fine-tuning. We utilized the 8-bit AdamW optimizer and pure bfloat16 training with gradient checkpointing. Galore requires only 28GB of VRAM to train a Llama-2 7B model, while the standard 8-bit AdamW optimizer requires at least 42GB of VRAM. Galore also demonstrates superior training speed, achieving about 130% of the throughput. We provide reproducible scripts for SFT training here: https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/extras/galore/galore_adamw_8bit_bf16.sh

8-bit AdamW GaLore 8-bit AdamW GRAM 28GB 42GB Speed 1.14 it/s 1.59 it/s image

Thank you for sharing! have you checked accuracy benchmarks too?

samuelazran avatar Mar 07 '24 17:03 samuelazran

@samuelazran nope, but the loss curve is pretty good for me

hiyouga avatar Mar 07 '24 17:03 hiyouga

  • @hiyouga It would be interesting to benchmark some state-of-the-art LLMs on a few tasks from the LLM leaderboard. The accuracy reported on the GLUE benchmark using pre-trained RoBERTa-Base doesn't seem to be increasing by a large margin
Screenshot 2024-03-07 at 11 34 53 PM

monk1337 avatar Mar 07 '24 18:03 monk1337

image GaLore is way better than LoRA and its variants in terms of loss, based on my small-scale experiment, even though this was not a rigorous study and only serve as a preliminary test.

Larryvrh avatar Mar 08 '24 02:03 Larryvrh

Hi @hiyouga, I am trying out GaLore with this repo. However, I am experiencing a very low throughput on an A6000. How did you manage to make it >1it/s? In addition, if I understand correctly, GaGlore reduces O(N) operations (element-wise scaling) but adds more O(N^3) operations (SVD and projections) upon Adam-8bit, how is it faster instead?

yongchanghao avatar Mar 08 '24 06:03 yongchanghao

@yongchanghao Sorry, we might miss some experimental details. We used gradient_accumulation_steps=2 in the above experiments. We advise using a larger batch size but smaller gradient accumulation steps for better throughput. Regarding the complexity issue, SVD is only performed once every update_proj_gap steps, so GaLore performs better in most cases during training.

hiyouga avatar Mar 08 '24 11:03 hiyouga

@yongchanghao Sorry, we might miss some experimental details. We used gradient_accumulation_steps=4 in the above experiments. Using the provided script (with ga=8) reduces the throughput by half. We advise using a larger batch size but smaller gradient accumulation steps for better throughput. Regarding the complexity issue, SVD is only performed once every update_proj_gap steps, so GaLore performs better in most cases during training.

@hiyouga I'm also confused why GaLore can improve throughput without increasing batch_size. Actually, in the paper it mentioned "which induces 17% overhead compared to 8-bit Adam implementation." And Table 8 shows that 8-bit GaLore is slower than Adam8bit. image

pkumc avatar Mar 08 '24 12:03 pkumc

@pkumc The previous results were somewhat unfair indeed. Now we have adjusted the experimental setup and updated the results. When the rank is small (<128), GaLore still has better throughput. I guess it may be because GaLore has fewer FLOPs in training. Regarding the data reported in the paper, we have discussed it with the author, and it may be due to different hardware which has varied GEMM performance.

hiyouga avatar Mar 08 '24 18:03 hiyouga

@hiyouga Thanks for the update. I feel the current data make more sense.

For future readers' reference, my preliminary experience aligns well with the data reported in in https://github.com/jiaweizzhao/GaLore/issues/3#issuecomment-1985411364

yongchanghao avatar Mar 08 '24 20:03 yongchanghao

Hello everyone! I'm also experimenting with the Galore optimizer and the loss curves look great! But I don't see any benefit in memory usage, Galore even uses a bit more.

I compare a TinyLlama full parameter finetune, same batch size and learning rate, FA2 enabled, GaLoreAdamW8bit (🟢) versus adamw_8bit (🔵). Not sure if the optimizer parameters scale=1, rank=1024, and update_proj_gap=200 I picked make sense. Loss looks good with these settings, but

  • GaLoreAdamW8bit should use less memory right?
  • and, should it not also be faster?
Screenshot 2024-03-13 at 21 23 55

Code:

use_galore = True
lr = 1e-5
modelpath = "../models/TinyLlama-1.1B-intermediate-step-1431k-3T"
set_seed(42)
run_id = f"use_galore-{use_galore}-{str(uuid.uuid4())}"

model = AutoModelForCausalLM.from_pretrained(
    modelpath,    
    torch_dtype=torch.bfloat16,
    attn_implementation = "flash_attention_2",  
    use_cache = False,
)
tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast = False)
train_dataset = load_dataset('imdb', split='train')

def load_galore_optimizer(model, target_modules_list=["attn", "mlp"]):
    galore_params = []
    for module_name, module in model.named_modules():
        if not isinstance(module, nn.Linear): continue
        if not any(target_key in module_name for target_key in target_modules_list): continue
        galore_params.append(module.weight)
        print(module_name)
    id_galore_params = {id(p) for p in galore_params}
    regular_params = [p for p in model.parameters() if id(p) not in id_galore_params]

    param_groups = [
        dict(params=regular_params),
        dict(
            params=galore_params,
            rank=1024,
            update_proj_gap=200,
            scale=1,
            proj_type="std",
        ),
    ]
    optimizer = GaLoreAdamW8bit(param_groups, lr=lr)
    scheduler = get_constant_schedule(optimizer)
    return optimizer, scheduler

args = TrainingArguments(
    output_dir = run_id,
    optim = "adamw_8bit",
    logging_steps = 1, 
    max_steps = 100,
    per_device_train_batch_size = 16,
    learning_rate = lr,
    lr_scheduler_type = "constant",
    gradient_checkpointing = True,
)

trainer = SFTTrainer(
    model = model, 
    train_dataset = train_dataset,
    dataset_text_field = 'text',
    max_seq_length = 512,
    optimizers = load_galore_optimizer(model) if use_galore else (None, None),
    args = args,
)
trainer.train()

What might be the issue here? Any input highly appreciated!

geronimi73 avatar Mar 13 '24 20:03 geronimi73

Hello everyone! I'm also experimenting with the Galore optimizer and the loss curves look great! But I don't see any benefit in memory usage, Galore even uses a bit more.

I compare a TinyLlama full parameter finetune, same batch size and learning rate, FA2 enabled, GaLoreAdamW8bit (🟢) versus adamw_8bit (🔵). Not sure if the optimizer parameters scale=1, rank=1024, and update_proj_gap=200 I picked make sense but the loss looks good at least. Memory should be less with GaLoreAdamW8bit and should it not also be faster?

Screenshot 2024-03-13 at 21 23 55 Code:
use_galore = True
lr = 1e-5
modelpath = "../models/TinyLlama-1.1B-intermediate-step-1431k-3T"
set_seed(42)
run_id = f"use_galore-{use_galore}-{str(uuid.uuid4())}"

model = AutoModelForCausalLM.from_pretrained(
    modelpath,    
    torch_dtype=torch.bfloat16,
    attn_implementation = "flash_attention_2",  
    use_cache = False,
)
tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast = False)
train_dataset = load_dataset('imdb', split='train')

def load_galore_optimizer(model, target_modules_list=["attn", "mlp"]):
    galore_params = []
    for module_name, module in model.named_modules():
        if not isinstance(module, nn.Linear): continue
        if not any(target_key in module_name for target_key in target_modules_list): continue
        galore_params.append(module.weight)
        print(module_name)
    id_galore_params = {id(p) for p in galore_params}
    regular_params = [p for p in model.parameters() if id(p) not in id_galore_params]

    param_groups = [
        dict(params=regular_params),
        dict(
            params=galore_params,
            rank=1024,
            update_proj_gap=200,
            scale=1,
            proj_type="std",
        ),
    ]
    optimizer = GaLoreAdamW8bit(param_groups, lr=lr)
    scheduler = get_constant_schedule(optimizer)
    return optimizer, scheduler

args = TrainingArguments(
    output_dir = run_id,
    optim = "adamw_8bit",
    logging_steps = 1, 
    max_steps = 100,
    per_device_train_batch_size = 16,
    learning_rate = lr,
    lr_scheduler_type = "constant",
    gradient_checkpointing = True,
)

trainer = SFTTrainer(
    model = model, 
    train_dataset = train_dataset,
    dataset_text_field = 'text',
    max_seq_length = 512,
    optimizers = load_galore_optimizer(model) if use_galore else (None, None),
    args = args,
)
trainer.train()

What might be the issue here? Any input highly appreciated!

I believe you need do galore layer by layer in order to save memory, as in https://github.com/jiaweizzhao/GaLore/blob/a6bc1650984b1c090a4e108d7c0e3109ee7ad844/torchrun_main.py#L334

Larryvrh avatar Mar 14 '24 00:03 Larryvrh

image GaLore is way better than LoRA and its variants in terms of loss, based on my small-scale experiment, even though this was not a rigorous study and only serve as a preliminary test.

Hello, I'd like to know what data and model you used to achieve this effect?

Leosgp avatar Mar 21 '24 15:03 Leosgp

image

This is not a formal research. Although Galore reduces the amount of memory used, it is undeniable that Galore increases the training time by a factor of three. The increase in time is not friendly to LLM training.

This is test code:

'''
#install
conda create --name test python=3.11
conda activate test

export CUDA_HOME=xxxxxxx
export LD_LIBRARY_PATH=$CUDA_HOME"/lib64:$LD_LIBRARY_PATH"
export PATH=$CUDA_HOME"/bin:$PATH"
pip install -U transformers trl datasets
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install galore-torch

HF support optimizer
['adamw_hf', 'adamw_torch', 'adamw_torch_fused', 'adamw_torch_xla', 'adamw_torch_npu_fused', 'adamw_apex_fused', 'adafactor', 'adamw_anyprecision', 'sgd', 'adagrad', 'adamw_bnb_8bit', 'adamw_8bit', 'lion_8bit', 'lion_32bit', 
'paged_adamw_32bit', 'paged_adamw_8bit', 'paged_lion_32bit', 'paged_lion_8bit', 'rmsprop', 'rmsprop_bnb', 'rmsprop_bnb_8bit', 'rmsprop_bnb_32bit', 
'galore_adamw', 'galore_adamw_8bit', 'galore_adafactor', 
'galore_adamw_layerwise', 'galore_adamw_8bit_layerwise', 'galore_adafactor_layerwise']

'''
import torch
import datasets
from transformers import TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForCausalLM
import trl, time

train_dataset = datasets.load_dataset('imdb', split='train')

args = TrainingArguments(
    output_dir="./test-galore",
    max_steps=100,
    per_device_train_batch_size=2,
    optim="adamw_hf",
    optim_target_modules=["attn", "mlp"]
)

model_id = "Qwen/Qwen1.5-0.5B"   
#model_id = "Qwen/Qwen1.5-4B"
#model_id = "Qwen/Qwen1.5-7B"
#model_id = "mistralai/Mistral-7B-v0.1"

config = AutoConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_config(config).to(0)

trainer = trl.SFTTrainer(
    model=model, 
    args=args,
    train_dataset=train_dataset,
    dataset_text_field='text',
    max_seq_length=512,
)

start_time = time.time()
trainer.train()
train_time = time.time()-start_time

print(f"=====================================================")
print(f"Time Used: {train_time:.2f} s")
print(f"memory_allocated: {torch.cuda.memory_allocated()/1024.0/1024.0:.2f} MB")
print(f"max_memory_allocated: {torch.cuda.max_memory_allocated()/1024.0/1024.0:.2f} MB")
print(f"memory_reserved: {torch.cuda.memory_reserved()/1024.0/1024.0:.2f} MB")
print(f"max_memory_reserved: {torch.cuda.max_memory_reserved()/1024.0/1024.0:.2f} MB")
print(f"free memory: {torch.cuda.mem_get_info()[0]/1024.0/1024.0:.2f} MB")
print(f"=====================================================")

WangRongsheng avatar Apr 05 '24 10:04 WangRongsheng

Thanks for providing your results @WangRongsheng . We are working on efficiency optimization and you can expect a big throughput boost in the next version. For train_loss, did you tune lr for GaLore?

jiaweizzhao avatar Apr 05 '24 19:04 jiaweizzhao

Thanks for providing your results @WangRongsheng . We are working on efficiency optimization and you can expect a big throughput boost in the next version. For train_loss, did you tune lr for GaLore?

I will do it.

WangRongsheng avatar Apr 06 '24 12:04 WangRongsheng