Michael Goin comments

Results 270 comments of


                                            Michael Goin

[Model] [Quantization] Support deepseek v3/r1 w8a8 int8 block-wise quantization

@Tmn07 @schung-amd would either of you want to revive this PR? Sorry for losing track of this

[Bug]: `gemma-2-27b-it-GGUF`: `Architecture gemma2 not supported`

This seems like it might be an issue with transformers gguf support since the error is in `transformers/modeling_gguf_pytorch_utils.py`, do you have an idea @Isotr0py ? Per this dictionary in transformers,...

[Bug]: `gemma-2-27b-it-GGUF`: `Architecture gemma2 not supported`

Okay thank you for clarifying! @alllexx88 I would recommend opening an issue on the transformers repo to resolve this https://github.com/huggingface/transformers/issues?q=is%3Aissue+is%3Aopen+gguf

error: fp8e4nv data type is not supported on CUDA arch < 89

Could triton support conversions from fp8 to/from fp16? I understand the lack of compute support, but it would be nice to be able to cast and work with the type,...

[Kernel] Integrate DeepGEMM dense block fp8

@houseroad it seems worse at small M but better at large M compared to our CUTLASS kernels, however this is only true for specific shapes. I need to do more...

[V1][TPU] TPU multimodal model support for ragged attention

> I wonder whether the next effort could be on pre-compiling just the fixed-sized encoders Yes exactly, this is the stated plan. I just wanted to pull this complexity into...

Use 88 as the line length to be compatible with Black

Personally I would appreciate a tad more line width given we already have imports that are longer than 80, but this decision should come down to large consensus. Maybe a...

[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS

@LeiWang1999 thanks for the WIP, very cool interface with bitblas as a package. Can you explain if the GPTQ benchmarking results in vLLM were run with the base "gptq" kernels...

[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS

Thanks for all the work @LeiWang1999! I have a few high-level thoughts first on how to make landing this more straightforward: 1. Make bitblas an optional dependency and remove from...

[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS

@LeiWang1999 thanks for the ping and updates, excited to review!