Isotr0py
Isotr0py
It's because the qunatized models missed `"tie_word_embeddings": false` field in `llm_config`, you can add it manually in the model's `config.json`: https://huggingface.co/OpenGVLab/InternVL3-38B/blob/main/config.json#L93
> Is that something that I should dig into from an llm-compressor side, or is it just expected that quantized models may need some config fields copied over from the...
> transformers doesn't serialize config values that are equal to the default values > I'm wondering how do we end up with different values for tie_word_embeddings in vLLM and in...
> Maybe we can do with AutoConfig and reassign values for text config? Sounds great! We have had to keep lots of modifed custom configs in `vllm/transformers_utils/configs` for similar reasons,...
@ywang96 Ok, I will decouple them tonight. (Sorry that I don't have bandwidth at daytime)
Seems setting `temperature=0` can generate reasonable grounding outputs: ``` vllm serve OpenGVLab/InternVL3-2B --max-model-len 4096 ``` ``` $ python examples/online_serving/openai_chat_completion_client_for_multimodal.py INFO 06-13 17:23:23 [__init__.py:244] Automatically detected platform cuda. Chat completion output...
This app is a little bit complicated and difficult to debug. Can you check if this simple script can generate correct grounding results on your serving models? ```python3 import ast...
I'm afraid not. Because the GGUF support in vllm is dependent on the GGUF interoperability in `transformers` (we depend on it to extract hf_config from GGUF), and Deepseek and its...
I see, you can refer to the output of the CI here: https://github.com/vllm-project/vllm/actions/runs/11354659097/job/31582376192?pr=9013#step:5:18
@SunMarc The [v0.10.0](https://pypi.org/project/gguf/0.10.0/) `gguf` has been released! I think we can start to review this PR!