lorax Improve error handling in SGMV kernels

Any failure in SGMV comes back as Request failed during generation: Server error: No suitable kernel. dtype=Half

From Discord:

I have tried the finetune adapter for llama2-7b. I trained model on predibase page. I downloaded adapter and places on https://huggingface.co/marekk/Lemma-Llama-2-7b-Adapter/tree/main. Now I am training load this adapter on llama2-7b but quantized. My args are: [ "--model-id", "meta-llama/Llama-2-7b-hf", "--quantize", "bitsandbytes-fp4", "--max-batch-prefill-tokens", "1024"]. Model without adapter works fine, but when I try to use adapter I get Request failed during generation: Server error: No suitable kernel. dtype=Half. Is there any way to use adapter on quantized version of model?

Sounds like an error in SGMV kernel that's being swallowed.

Mar 12 '24 21:03 tgaddair

Suspect the issue may be hardware or environment related. Haven't been able to repro on A100 yet.

Regardless, we do need more helpful error messages.

Mar 12 '24 22:03 tgaddair

I am seeing this too when testing a qlora adapter tuned from a quantized model!

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

Apr 06 '24 12:04 SamComber

same problem when testing qwen2 with its lora adapter, hope there is a solution

Aug 08 '24 07:08 estuday

I'm getting the same error during warmup:

2025-03-04T23:42:30.579711Z ERROR lorax_launcher: interceptor.py:41 Method Warmup encountered an error.
Traceback (most recent call last):
    ...
RuntimeError: No suitable kernel. h_in=256 h_out=2048 dtype=BFloat16

I'm trying to start a Lorax docker container on a machine with 4 A100s with this command:

sudo docker run --gpus all --shm-size 1g -p 8080:80 -v $(pwd)/data:/data -e HF_HUB_ENABLE_HF_TRANSFER=1 ghcr.io/predibase/lorax:main --model-id Qwen/Qwen2.5-72B-Instruct --num-shard 4 --quantize bitsandbytes-nf4

Mar 04 '25 23:03 Zollerboy1

lorax lorax copied to clipboard

Improve error handling in SGMV kernels

lorax
lorax copied to clipboard