Lil2J
Lil2J
**Describe the bug** I use deepspeed.init_inference to accelerate the inference of the Qwen model. When I compare it with not using deepspeed.init_inference, I find that there is no acceleration. Then...
When I use nvfp4 + kvcached + sglang, the following error occurs: File "/usr/local/lib/python3.12/dist-packages/sglang/srt/layers/attention/base_attn_backend.py", line 91, in forward return self.forward_decode( ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/sglang/srt/layers/attention/flashinfer_backend.py", line 815, in forward_decode forward_batch.token_to_kv_pool.set_kv_buffer( File "/usr/local/lib/python3.12/dist-packages/sglang/srt/mem_cache/memory_pool.py",...
I’m currently using an image-based setup to start kvcached + sglang, and now I want to use the Qwen3 FP8 model. The inference framework can be successfully deployed, but as...