flashinfer
flashinfer copied to clipboard
failed to dispatch head_dim 96
env CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=10 python -m sglang.launch_server --model-path vonjack/Phi-3-mini-4k-instruct-LLaMAfied --port 30000
When loading vonjack/Phi-3-mini-4k-instruct-LLaMAfied using sglang, the following error occurs.
server_args=ServerArgs(model_path='vonjack/Phi-3-mini-4k-instruct-LLaMAfied', tokenizer_path='vonjack/Phi-3-mini-4k-instruct-LLaMAfied', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', trust_remote_code=False, context_length=None, quantization=None, served_model_name='vonjack/Phi-3-mini-4k-instruct-LLaMAfied', chat_template=None, host='127.0.0.1', port=30000, additional_ports=[30001, 30002, 30003, 30004], mem_fraction_static=0.88, max_running_requests=None, max_num_reqs=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=173762660, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_disk_cache=False, enable_torch_compile=False, enable_p2p_check=False, enable_mla=False, attention_reduce_in_fp32=False, efficient_weight_load=False, nccl_init_addr=None, nnodes=1, node_rank=None)
[gpu=0] Init nccl begin.
[gpu=0] Load weight begin. avail mem=94.87 GB
INFO 08-20 05:30:40 weight_utils.py:225] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.17it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.17it/s]
[gpu=0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=87.66 GB
[gpu=0] Memory pool end. avail mem=11.20 GB
[gpu=0] Capture cuda graph begin. This can take up to several minutes.
Process Process-1:
Initialization failed. controller_init_state: Traceback (most recent call last):
File "/root/miniconda3/envs/base2/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 371, in init_cuda_graphs
self.cuda_graph_runner.capture(batch_size_list)
File "/root/miniconda3/envs/base2/lib/python3.11/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 162, in capture
) = self.capture_one_batch_size(bs, forward)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/base2/lib/python3.11/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 214, in capture_one_batch_size
update_flashinfer_indices(
File "/root/miniconda3/envs/base2/lib/python3.11/site-packages/sglang/srt/model_executor/forward_batch_info.py", line 289, in update_flashinfer_indices
flashinfer_decode_wrapper.begin_forward(
File "/root/projects/fanfiction-go/python/hub/flashinfer/python/flashinfer/decode.py", line 539, in begin_forward
self._wrapper.begin_forward(
RuntimeError: BatchDecodeWithPagedKVCachePyTorchWrapper::BeginForward(at::Tensor, at::Tensor, at::Tensor, at::Tensor, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, float, at::Tensor, at::Tensor)::<lambda()>::<lambda()>::<lambda()> failed to dispatch head_dim 96
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/envs/base2/lib/python3.11/site-packages/sglang/srt/managers/controller_single.py", line 150, in start_controller_process
controller = ControllerSingle(
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/base2/lib/python3.11/site-packages/sglang/srt/managers/controller_single.py", line 84, in __init__
self.tp_server = ModelTpServer(
^^^^^^^^^^^^^^
File "/root/miniconda3/envs/base2/lib/python3.11/site-packages/sglang/srt/managers/tp_worker.py", line 99, in __init__
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/root/miniconda3/envs/base2/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 140, in __init__
self.init_cuda_graphs()
File "/root/miniconda3/envs/base2/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 373, in init_cuda_graphs
raise Exception(
Exception: Capture cuda graph failed: BatchDecodeWithPagedKVCachePyTorchWrapper::BeginForward(at::Tensor, at::Tensor, at::Tensor, at::Tensor, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, float, at::Tensor, at::Tensor)::<lambda()>::<lambda()>::<lambda()> failed to dispatch head_dim 96
@yzh119, I encountered the same issue. As you are working on this, what is remaining work needed to resolve this issue?