Isotr0py
Isotr0py
Can you try to add `--tokenizer=microsoft/Phi-4-mini-instruct` when serving the model? I suspect the Phi-4 tokenizer conversion is broken at transformers side.
Can you try this command for serving? I can generate reasonable outputs on main branch with q4 checkpoint in [microsoft/phi-4-gguf](https://huggingface.co/microsoft/phi-4-gguf): ``` vllm serve /tmp/phi-4-q4.gguf --max-model-len 4096 --dtype half --tokenizer microsoft/phi-4...
I also tried [Q6_K](https://huggingface.co/unsloth/phi-4-GGUF/blob/main/phi-4-Q6_K.gguf) but still can't reproduce the CUDA index error, can you provide the information that which Q6 checkpoint are you using? ``` vllm serve /tmp/phi-4-Q6_K.gguf --max-model-len 4096...
There were some issues in HF's opt repo yesterday, which should have been fixed. I think re-run these CIs should be just fine?
Please address pre-commit linting errors as well.
> Somehow the token is not handled properly during the profiling phase of vLLM. Can you point me into the right direction how is multimodal processing done in vLLM? Because...
I would like to work on this model. But it seems that the `persimmon` used as language model in `Fuyu-8B` hasn't been supported. Maybe we can support it first.
We haven't supported `gguf` quantization on cpu backend yet. You can try to install vllm with GPU backend.
The released 0.5.4 version hasn't included the gguf supoport. You can build from source code or install the latest nightly wheel: ```bash export VLLM_VERSION=0.5.4 # vLLM's main branch version is...
@AlpinDale Thanks! I'm glad to push this forward by adding quants kernels! I'm not familiar with the quantization in `ggml` and it's difficult for me to implement the `mmq`/`mmvq` ops....