AlpinDale comments

Results 170 comments of


                                            AlpinDale

[Core] Support loading GGUF model

Great job @Isotr0py, sorry I was away for a while. Would it be better to directly port the dequantization kernels to vLLM instead of relying on the transformers integration? They...

[Core] Support loading GGUF model

Yeah, the kernels are CUDA only (and they don't work with ROCm for now). It'd be exciting if this PR can be merged with the proper dequant kernels, so I...

[Core] Support loading GGUF model

My username is `alpindale`

[Core] Support loading GGUF model

As discussed privately with @Isotr0py it may be best if we shipped a custom, up-to-date gguf utility library. Currently, aphrodite directly [bundles the code](https://github.com/PygmalionAI/aphrodite-engine/tree/main/aphrodite%2Fquantization%2Fgguf_utils), we can get away with it...

[Bug]: 0.6.3.post1 regression: RuntimeError during mem profiling on Mistral Large AWQ with `-q awq_marlin`

For the sentencepiece error, removing the mistral tokenizer mode flag seems to resolve this. As discussed earlier, I will be separating the windows and Linux codepaths for the marlin kernels...

[Bug]: ModuleNotFoundError: No module named 'ray'

Can you share your Docker command? We should not use Ray unless you launch the engine with `--worker-use-ray` or `--distributed-executor-backend=ray`.

[Bug]: ModuleNotFoundError: No module named 'ray'

Can you add `--distributed-executor-backend=mp` to the launch flags?

[Bug]: FP Quantizer Error when loading using --quantization deepspeedfp, Triton version related

As of #755, it's recommended to use `-q fpX` instead, where `X` is a number between 2 and 7.

[Bug]: FP8 KV Cache FLASHINFER AssertionError

What FlashInfer version do you have installed?

[Bug]: FP8 KV Cache FLASHINFER AssertionError

FlashInfer currently doesn't work with finetuned k/v scales. You'll need to use a checkpoint without the KV cache quantization.