Liger-Kernel No Significant Improvement Observed in Model Training Speed

No Significant Improvement Observed in Model Training Speed

Open lianghsun opened this issue 1 year ago • 8 comments

🐛 Describe the bug

I am training the meta-llama/Llama-3.2-1B model using LLaMA-Factory with the following YAML configuration:

### model
model_name_or_path: meta-llama/Llama-3.2-1B

### method
stage: pt
do_train: true
do_eval: true
finetuning_type: full
deepspeed: /path/to/ds_z3_config.json
use_liger_kernel: true
enable_liger_kernel: true

### dataset
dataset: /path/to/dir/
eval_dataset: /path/to/dir/
template: llama3
cutoff_len: 4000
max_samples: 30000000000
overwrite_cache: true
preprocessing_num_workers: 64
preprocessing_batch_size: 60000
tokenized_path: /path/to/dir/

### output
output_dir: /path/to/dir/
logging_steps: 1
save_steps: 5
plot_loss: true
overwrite_output_dir: true
save_total_limit: 8

### train
per_device_train_batch_size: 94
gradient_accumulation_steps: 32
learning_rate: 5.0e-5
num_train_epochs: 10
lr_scheduler_type: cosine
optim: adamw_torch_fused
warmup_ratio: 0.01
weight_decay: 0.1
bf16: true
ddp_timeout: 1080
ddp_find_unused_parameters: false
max_grad_norm: 1.0
seed: 42
dataloader_num_workers: 64
packing: true
flash_attn: auto

### eval
per_device_eval_batch_size: 4
eval_strategy: steps
eval_steps: 10

However, I have noticed that enabling or disabling liger_kernel does not lead to any noticeable reduction in training time. The runtime metrics remain nearly identical in both cases. Are there specific parameter settings in my YAML configuration that might be preventing liger_kernel from functioning optimally? Thanks :(

Reproduce

Use the YAML configuration provided above.
Train the meta-llama/Llama-3.2-1B model with and without liger_kernel.

llamafactory-cli train /path/to/above/yaml

Compare training times and throughput metrics.

Versions

Environment Report:

Operating System: Linux-5.15.0-107-generic-x86_64-with-glibc2.35 Python version: 3.12.7 PyTorch version: 2.5.1+cu124 CUDA version: 12.4 Triton version: 3.1.0 Transformers version: 4.46.1

Nov 27 '24 09:11 lianghsun

Liger-Kernel Liger-Kernel copied to clipboard

No Significant Improvement Observed in Model Training Speed

🐛 Describe the bug

Reproduce

Versions

Environment Report:

Liger-Kernel
Liger-Kernel copied to clipboard