lorax
lorax copied to clipboard
Improve error handling in SGMV kernels
Any failure in SGMV comes back as Request failed during generation: Server error: No suitable kernel. dtype=Half
From Discord:
I have tried the finetune adapter for llama2-7b. I trained model on predibase page. I downloaded adapter and places on https://huggingface.co/marekk/Lemma-Llama-2-7b-Adapter/tree/main. Now I am training load this adapter on llama2-7b but quantized. My args are: [ "--model-id", "meta-llama/Llama-2-7b-hf", "--quantize", "bitsandbytes-fp4", "--max-batch-prefill-tokens", "1024"]. Model without adapter works fine, but when I try to use adapter I get Request failed during generation: Server error: No suitable kernel. dtype=Half. Is there any way to use adapter on quantized version of model?
Sounds like an error in SGMV kernel that's being swallowed.
Suspect the issue may be hardware or environment related. Haven't been able to repro on A100 yet.
Regardless, we do need more helpful error messages.
I am seeing this too when testing a qlora adapter tuned from a quantized model!
config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
"lm_head",
],
bias="none",
lora_dropout=0.05, # Conventional
task_type="CAUSAL_LM",
)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
same problem when testing qwen2 with its lora adapter, hope there is a solution
I'm getting the same error during warmup:
2025-03-04T23:42:30.579711Z ERROR lorax_launcher: interceptor.py:41 Method Warmup encountered an error.
Traceback (most recent call last):
...
RuntimeError: No suitable kernel. h_in=256 h_out=2048 dtype=BFloat16
I'm trying to start a Lorax docker container on a machine with 4 A100s with this command:
sudo docker run --gpus all --shm-size 1g -p 8080:80 -v $(pwd)/data:/data -e HF_HUB_ENABLE_HF_TRANSFER=1 ghcr.io/predibase/lorax:main --model-id Qwen/Qwen2.5-72B-Instruct --num-shard 4 --quantize bitsandbytes-nf4