Kero Liang
Kero Liang
This PR continues the idea of #17617. Thanks @vadiklyutiy Could you please take a look? @ywang96
Does V1 support FP8 (W8A8) quantization? I tried [nm-testing/Qwen2-VL-7B-Instruct-FP8-dynamic](https://huggingface.co/nm-testing/Qwen2-VL-7B-Instruct-FP8-dynamic) on v0.7.1 V1 arch, no error thrown but got gibberish result. Same code and model works properly on v0.7.1 V0 arch....
If `len(logit_bias)` is large, maybe we can keep the copy of `logit_bias["index"]` and `logit_bias["value"]` in the device memory ahead of time (or in the first sample step), and re-use it...