Robert Shaw
Robert Shaw
Closed because this functionality was completed in https://github.com/vllm-project/vllm/commit/fb6af8bc086328ca6659e72d11ffd4309ce4de22
> Can you try some of the test cases in #5846, #5872, w/ and w/o chunked prefill ? > > Additionally, you should be able mark #4904, #4772, #5334, #5872...
> sampler test broke Yup, its due to chunked_prefill. Im fixing it. @simon-mo @njhill @Yard1 will need re-review
> Can we have a regression test? Also I have impression the current fix won't work with chunked prefill (mainly because second chunk won't have None for the first prompt...
Okay, chunked prefill needed more fixes than I expected. I had to back my changes out the sampler because it required poking around too much in the sequence_data to detect...
👀
fp8 not yet supported for Qwen. WIP PR: https://github.com/vllm-project/vllm/pull/6088
Fp8 is now supported for Qwen, but MoE Fp8 requires compute_capability == 9.0 (aka Hopper GPUs) Our MoE kernels are currently implemented using Triton, which require triton==3.0 for Fp8 on...
The source of `AssertionError: expected running sequences` is due to `abort` not yet being supported with `multi-step` scheduling. `multi-step` scheduling is a new feature we are still working on -...
> Two minutes later the next error: > > ``` > │ return self._call_impl(*args, **kwargs) │ > │ File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl │ > │ return forward_call(*args, **kwargs)...