Isotr0py
Isotr0py
Hmm, at least they still kept the FMA fallback when deprecating MMAv1 on pre-Ampere GPUs. (https://github.com/triton-lang/triton/pull/5066). I also ran the prefix_prefill tests and it passed on dual T4 GPUs (with...
@WoosukKwon According to triton team's response (https://github.com/triton-lang/triton/pull/5066#issuecomment-2695159799), old platform with FMA support code path is still maintained on main branch, though the MMA support is deprecated for old platforms. So...
@taoluo I think that's because new version `triton` has removed MMAv1 support for Volta and Turing, but not sure if it's an exact bug in triton as well. You can...
Oh, didn't notice that there has been a same PR, my bad! 😅
@ywang96 I have migrated all datasets from `benchmark_serving.py` to `datatset_sample_func.py`, PTAL!
AWQ doesn't support bfloat16, you need to add `dtype="float16` when initializing `LLM` to cast dtype.
@zhengy001 Can you run a benchmark for this PR? LGTM once the performance is determined to be still reasonable. (I don't have idle device to run with FA2 backend right...
You can sync this PR branch with the main branch to re-run the CI from new commit.
But I think `DiffusionPipeline` will still initialize HF-style text-encoder when calling `from_pretrained`? ```python3 pipeline = DiffusionPipeline.from_pretrained( **dit_config, # "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="auto" ) ```
Hmmm, we can set fp32 for `gte-Qwen2-1.5B`, but for `"ssmits/Qwen2-7B-Instruct-embed-base"`, seems that the CI machine won't have enough VRAM to run it with fp32.