Liger-Kernel No Significant Improvement Observed in Model Training Speed

No Significant Improvement Observed in Model Training Speed

Open albertbou92 opened this issue 7 months ago • 1 comments

trafficstars

I am trying to speedup inference and training of a mistralai/Mistral-Small-3.1-24B-Instruct-2503 model.

Simply replacing AutoModelForCausalLM with AutoLigerKernelForCausalLM does not lead to any speedup in my sampling speed or memory usage. I am also using DeepSpeed for distributed training.

model = AutoLigerKernelForCausalLM.from_pretrained(
            "mistralai/Mistral-Small-3.1-24B-Instruct-2503",
            torch_dtype=TorchDtype.float32,
            attn_implementation="sdpa",
        )

I have also tried this with the same result:

model = AutoModelForCausalLM.from_pretrained(
        "mistralai/Mistral-Small-3.1-24B-Instruct-2503",
        torch_dtype=TorchDtype.float32,
       attn_implementation="sdpa",
)
apply_liger_kernel_to_mistral(
    rope=True,
    cross_entropy=False,
    fused_linear_cross_entropy=True,
    rms_norm=True,
    swiglu=True,
    model=model,
)

Am I missing anything? Thanks for any help. Should I expect to see the speedup and memory optimization in the autoregressive generative sampling or in the backward pass? or in both?

########################

Python version: 3.12.9 PyTorch version: 2.6.0+cu124 CUDA version: 12.4 Triton version: 3.2.0 Transformers version: 4.51.1 DeepSpeed version: 0.15.4

Apr 11 '25 23:04 albertbou92

Hi, you should see the perf gains while training (fwd + bwd)

Apr 21 '25 22:04 shivam15s

Liger-Kernel Liger-Kernel copied to clipboard

No Significant Improvement Observed in Model Training Speed

Liger-Kernel
Liger-Kernel copied to clipboard