Liger-Kernel
Liger-Kernel copied to clipboard
No Significant Improvement Observed in Model Training Speed
I am trying to speedup inference and training of a mistralai/Mistral-Small-3.1-24B-Instruct-2503 model.
Simply replacing AutoModelForCausalLM with AutoLigerKernelForCausalLM does not lead to any speedup in my sampling speed or memory usage. I am also using DeepSpeed for distributed training.
model = AutoLigerKernelForCausalLM.from_pretrained(
"mistralai/Mistral-Small-3.1-24B-Instruct-2503",
torch_dtype=TorchDtype.float32,
attn_implementation="sdpa",
)
I have also tried this with the same result:
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-Small-3.1-24B-Instruct-2503",
torch_dtype=TorchDtype.float32,
attn_implementation="sdpa",
)
apply_liger_kernel_to_mistral(
rope=True,
cross_entropy=False,
fused_linear_cross_entropy=True,
rms_norm=True,
swiglu=True,
model=model,
)
Am I missing anything? Thanks for any help. Should I expect to see the speedup and memory optimization in the autoregressive generative sampling or in the backward pass? or in both?
########################
Python version: 3.12.9 PyTorch version: 2.6.0+cu124 CUDA version: 12.4 Triton version: 3.2.0 Transformers version: 4.51.1 DeepSpeed version: 0.15.4
Hi, you should see the perf gains while training (fwd + bwd)