Isotr0py
Isotr0py
@AlpinDale I agree that we can directly port the [Aphrodite Engine](https://github.com/PygmalionAI/aphrodite-engine)'s dequantization kernels to vLLM. But I think we can also keep the transformers integration dequantization for CPU backend until...
@AlpinDale I think we can discuss this further on discord. How I can communicate with you on discord?
Currently, install `gguf` from pypi will only get `gguf=0.6.0` which is an old version months ago. However, to use imatrix quantization, it requires newest version which need to install from...
Nice! I will check it out and add test for qwen2 and imatrix!
@mgoin Could you please take a look at this once again? The way to handle `get_quant_method` for vocal embedding is confusing me. Could you give some suggestions about this? Thanks!
The tensor parallelism hasn't worked yet, because we haven't considered the distributed situation with `tp_size` and `tp_rank` when modifying `weight_loader` for gguf quantization. I will try to fix the tensor...
OK, I have added a check to raise exception for `tp_size>1` when initialize `GGUFConfig`.
@vbiral Thanks for reporting! Seems that the `gguf_to_hf_name_map` didn't handle `rope_freqs` correctly. I will have a look and fix it.
I think we should also add audio example for phi-4-mm, since it supports audio inputs as well.
I haven't had machine to test 38B model yet. Can you check if smaller models like 8B/14B also have this issue?