AlpinDale
AlpinDale
At the moment, FP8 can't work with chunked prefill/context shifting. There's some work being done in [this branch](https://github.com/PygmalionAI/aphrodite-engine/tree/feat/fp8-chunked) to address this issue.
Can you test with the latest release?
I will take a closer look, but FYI, exl2 quants do not work with multi-gpu setups. It's the only quant with that limitation.
That would be the `-tp 2` in your command. Please see [here](https://github.com/PygmalionAI/aphrodite-engine/wiki/3.-Engine-Options) for a full list of the commands and what they do.
You can probably remove the `modeling_sparsetral` part from the model's config.json, it may work, but it'll skip all the MoE stuff. Same is happening with that exl2 quant I imagine,...
Works fine with the FP16 model. Can you link me to the gguf if it's public?
Ah I see what the issue is. We're using a custom GGUF model parser in aphrodite, so it means everything needs to be hand-written and implemented for every model arch....
@bash99 we have a PR at to fix this, and support arbitrary GGUF models.
We unfortunately had the install condition for punica and hadamard kernels using the wrong facing comparison sign. Fixed in the latest commit to dev.
@sgsdxzy seems to me like an issue with parallelizing the lm_head. Does your PR fix this?