AlpinDale comments

Results 170 comments of


                                            AlpinDale

[Bug]: Cannot load 70b exl2 5bpw model across 4 GPUs.

At the moment, FP8 can't work with chunked prefill/context shifting. There's some work being done in [this branch](https://github.com/PygmalionAI/aphrodite-engine/tree/feat/fp8-chunked) to address this issue.

[Bug]:

Can you test with the latest release?

[sparsetral and Qwen2idae]: support for mixtral of lora

I will take a closer look, but FYI, exl2 quants do not work with multi-gpu setups. It's the only quant with that limitation.

[sparsetral and Qwen2idae]: support for mixtral of lora

That would be the `-tp 2` in your command. Please see [here](https://github.com/PygmalionAI/aphrodite-engine/wiki/3.-Engine-Options) for a full list of the commands and what they do.

[sparsetral and Qwen2idae]: support for mixtral of lora

You can probably remove the `modeling_sparsetral` part from the model's config.json, it may work, but it'll skip all the MoE stuff. Same is happening with that exl2 quant I imagine,...

[sparsetral and Qwen2idae]: support for mixtral of lora

Works fine with the FP16 model. Can you link me to the gguf if it's public?

[sparsetral and Qwen2idae]: support for mixtral of lora

Ah I see what the issue is. We're using a custom GGUF model parser in aphrodite, so it means everything needs to be hand-written and implemented for every model arch....

[sparsetral and Qwen2idae]: support for mixtral of lora

@bash99 we have a PR at to fix this, and support arbitrary GGUF models.

[sparsetral and Qwen2idae]: support for mixtral of lora

We unfortunately had the install condition for punica and hadamard kernels using the wrong facing comparison sign. Fixed in the latest commit to dev.

[sparsetral and Qwen2idae]: support for mixtral of lora

@sgsdxzy seems to me like an issue with parallelizing the lm_head. Does your PR fix this?