lorax icon indicating copy to clipboard operation
lorax copied to clipboard

Improve error handling in SGMV kernels

Open tgaddair opened this issue 1 year ago • 4 comments

Any failure in SGMV comes back as Request failed during generation: Server error: No suitable kernel. dtype=Half

From Discord:

I have tried the finetune adapter for llama2-7b. I trained model on predibase page. I downloaded adapter and places on https://huggingface.co/marekk/Lemma-Llama-2-7b-Adapter/tree/main. Now I am training load this adapter on llama2-7b but quantized. My args are: [ "--model-id", "meta-llama/Llama-2-7b-hf", "--quantize", "bitsandbytes-fp4", "--max-batch-prefill-tokens", "1024"]. Model without adapter works fine, but when I try to use adapter I get Request failed during generation: Server error: No suitable kernel. dtype=Half. Is there any way to use adapter on quantized version of model?

Sounds like an error in SGMV kernel that's being swallowed.

tgaddair avatar Mar 12 '24 21:03 tgaddair

Suspect the issue may be hardware or environment related. Haven't been able to repro on A100 yet.

Regardless, we do need more helpful error messages.

tgaddair avatar Mar 12 '24 22:03 tgaddair

I am seeing this too when testing a qlora adapter tuned from a quantized model!

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
image

SamComber avatar Apr 06 '24 12:04 SamComber

same problem when testing qwen2 with its lora adapter, hope there is a solution image

estuday avatar Aug 08 '24 07:08 estuday

I'm getting the same error during warmup:

2025-03-04T23:42:30.579711Z ERROR lorax_launcher: interceptor.py:41 Method Warmup encountered an error.
Traceback (most recent call last):
    ...
RuntimeError: No suitable kernel. h_in=256 h_out=2048 dtype=BFloat16

I'm trying to start a Lorax docker container on a machine with 4 A100s with this command:

sudo docker run --gpus all --shm-size 1g -p 8080:80 -v $(pwd)/data:/data -e HF_HUB_ENABLE_HF_TRANSFER=1 ghcr.io/predibase/lorax:main --model-id Qwen/Qwen2.5-72B-Instruct --num-shard 4 --quantize bitsandbytes-nf4

Zollerboy1 avatar Mar 04 '25 23:03 Zollerboy1