zifeitong
zifeitong
Saw the same issue. I tested and the culprit commit should be #3015. It's not clear to me what's the root cause though.
So the problem is that https://github.com/vllm-project/vllm/blob/97b030005c7f5cde7c1b97c718a8841db7d6220b/vllm/engine/async_llm_engine.py#L509 triggered https://github.com/python/cpython/issues/86296 The bug is fixed in Python 3.12 but will not be backported. It's very easy to workaround it in Python 3.11 with...
Changed the URL to w3.org
> Hey! I am not sure whether skipping the first token would fix #5334. Have you tested the case in the issue? I think it is something specific to llama-2...
> > Hey! I am not sure whether skipping the first token would fix #5334. Have you tested the case in the issue? I think it is something specific to...
> This is great, thanks! May I ask what do you think causes the issue? Just not skipping the first/bos token? But if so why does llama-3 not have this...
Closing this PR in prefer of https://github.com/vllm-project/vllm/pull/6223
Can you try some of the test cases in https://github.com/vllm-project/vllm/pull/5846, https://github.com/vllm-project/vllm/issues/5872, w/ and w/o chunked prefill ? Additionally, you should be able mark #4904, #4772, #5334, #5872 as fixed.
Thanks for the fix! About the CI OOM issue, I am not sure if it's the best workaround, but [`wait_for_gpu_memory_to_clear`](https://github.com/vllm-project/vllm/blob/08c5bdecae5c5186c39a1d1ff444c3bf01f7fa0e/tests/utils.py#L192) has been helpful.
Sorry for the delay. I still need to get the tests passed. I'll let you know once it's working.