Woosuk Kwon
Woosuk Kwon
@robertgshaw2-neuralmagic We haven't used the `CustomOp` interface for the quantization-related ops, since they usually only support NVIDIA or AMD GPUs. Do you want to apply the interface to the quant...
This PR seems to break Mixtral. Let me check the reason.
@comaniac Could you please take a look? The PR removes a few lines of code in model loader that you marked as `FIXME`.
@comaniac Thanks for the confirmation! It works well.
~~For this PR, I will merge it after getting reviews. :)~~ The changes outside the TPU backend was reviewed in #6812 and #6813.
Hi @Isotr0py, thanks for sharing the information. > I think a compromise about this deprecation is only allowing user to specify VLLM_ATTENTION_BACKEND to enable this Triton backend fallback. So that...
Thanks for the PR! I will take a look tmr (Tue).
hi @devops724, thanks for reporting the bug. This line: `export VLLM_ATTENTION_BACKEND=FLASH_ATTN` causes the bug. Please do not set the env variable, or set it as `FLASH_ATTN_VLLM_V1` instead.