AlpinDale

Results 170 comments of AlpinDale

Great job @Isotr0py, sorry I was away for a while. Would it be better to directly port the dequantization kernels to vLLM instead of relying on the transformers integration? They...

Yeah, the kernels are CUDA only (and they don't work with ROCm for now). It'd be exciting if this PR can be merged with the proper dequant kernels, so I...

My username is `alpindale`

As discussed privately with @Isotr0py it may be best if we shipped a custom, up-to-date gguf utility library. Currently, aphrodite directly [bundles the code](https://github.com/PygmalionAI/aphrodite-engine/tree/main/aphrodite%2Fquantization%2Fgguf_utils), we can get away with it...

For the sentencepiece error, removing the mistral tokenizer mode flag seems to resolve this. As discussed earlier, I will be separating the windows and Linux codepaths for the marlin kernels...

Can you share your Docker command? We should not use Ray unless you launch the engine with `--worker-use-ray` or `--distributed-executor-backend=ray`.

Can you add `--distributed-executor-backend=mp` to the launch flags?

As of #755, it's recommended to use `-q fpX` instead, where `X` is a number between 2 and 7.

What FlashInfer version do you have installed?

FlashInfer currently doesn't work with finetuned k/v scales. You'll need to use a checkpoint without the KV cache quantization.