Woosuk Kwon comments

Results 278 comments of


                                            Woosuk Kwon

[Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod

@robertgshaw2-neuralmagic We haven't used the `CustomOp` interface for the quantization-related ops, since they usually only support NVIDIA or AMD GPUs. Do you want to apply the interface to the quant...

[Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod

This PR seems to break Mixtral. Let me check the reason.

[Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod

@comaniac Could you please take a look? The PR removes a few lines of code in model loader that you marked as `FIXME`.

[Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod

@comaniac Thanks for the confirmation! It works well.

[Hardware][TPU] Implement tensor parallelism with Ray

~~For this PR, I will merge it after getting reviews. :)~~ The changes outside the TPU backend was reviewed in #6812 and #6813.

[V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs

Hi @Isotr0py, thanks for sharing the information. > I think a compromise about this deprecation is only allowing user to specify VLLM_ATTENTION_BACKEND to enable this Triton backend fallback. So that...

[Doc] V1 user guide

Thanks for the PR! I will take a look tmr (Tue).

[Bug]: v0.7.3 upgrade issue,

hi @devops724, thanks for reporting the bug. This line: `export VLLM_ATTENTION_BACKEND=FLASH_ATTN` causes the bug. Please do not set the env variable, or set it as `FLASH_ATTN_VLLM_V1` instead.