Isotr0py comments

Results 139 comments of


                                            Isotr0py

[V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs

Hmm, at least they still kept the FMA fallback when deprecating MMAv1 on pre-Ampere GPUs. (https://github.com/triton-lang/triton/pull/5066). I also ran the prefix_prefill tests and it passed on dual T4 GPUs (with...

[V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs

@WoosukKwon According to triton team's response (https://github.com/triton-lang/triton/pull/5066#issuecomment-2695159799), old platform with FMA support code path is still maintained on main branch, though the MMA support is deprecated for old platforms. So...

[V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs

@taoluo I think that's because new version `triton` has removed MMAv1 support for Volta and Turing, but not sure if it's an exact bug in triton as well. You can...

[Misc] Avoid calling unnecessary `hf_list_repo_files` for local model path

Oh, didn't notice that there has been a same PR, my bad! 😅

[Misc] Separate hf dataset sampling function from benchmark_serving.py

@ywang96 I have migrated all datasets from `benchmark_serving.py` to `datatset_sample_func.py`, PTAL!

[Bug]: AWQ doesn't support 4-bit?

AWQ doesn't support bfloat16, you need to add `dtype="float16` when initializing `LLM` to cast dtype.

[Misc] Compute query_start_loc on CPU

@zhengy001 Can you run a benchmark for this PR? LGTM once the performance is determined to be still reasonable. (I don't have idle device to run with FA2 backend right...

[Misc] Compute query_start_loc on CPU

You can sync this PR branch with the main branch to re-run the CI from new commit.

[RFC][WIP]: Support diffusers and community acceleration backends

But I think `DiffusionPipeline` will still initialize HF-style text-encoder when calling `from_pretrained`? ```python3 pipeline = DiffusionPipeline.from_pretrained( **dit_config, # "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="auto" ) ```

[Bugfix] Fix the failing gte embedding test

Hmmm, we can set fp32 for `gte-Qwen2-1.5B`, but for `"ssmits/Qwen2-7B-Instruct-embed-base"`, seems that the CI machine won't have enough VRAM to run it with fp32.