Hu Dong comments

Results 18 comments of


                                            Hu Dong

Spark 3.0 build

Just saw a related pending PR: https://github.com/Azure/mmlspark/pull/912

NO-JIRA: [Python] Fix exception in IOHandler.on_selectable_expired/update

Could we get this PR merged? I ran into the bug recently. It causes qpid proton to be pretty unusable.

[Bug]: Engine iteration timed out. This should never happen!

Might be related to https://github.com/vllm-project/vllm/issues/3839 https://github.com/vllm-project/vllm/issues/4135 https://github.com/vllm-project/vllm/issues/4293 https://github.com/vllm-project/vllm/issues/6254 (which is fixed by https://github.com/vllm-project/vllm/pull/6255)

[Bug]: Engine iteration timed out. This should never happen!

> How easy is it to reproduce the issue? It's about 1/10 I think. It seemed to be very random, at least not directly caused by request concurrency, nor prompt...

[Bug]: Engine iteration timed out. This should never happen!

> Also, Is it possible to reproduce it with CUDA_LAUNCH_BLOCKING=1 and show us the line? We just tried. Here's the stacktrace with the env variable ``` ERROR 04-30 11:35:13 async_llm_engine.py:499]...

[Bug]: Engine iteration timed out. This should never happen!

FYI, we actually deployed several instances. They're running on different envs. The following instances have been running for more than 5 days without any problem: 1. vLLM 0.4.0post1, tp=4 (70B...

[Bug]: Engine iteration timed out. This should never happen!

> Can you also share the stacktrace of workers that are not stuck? (or is all workers stuck at the same line?) Not sure whether the following is what we...

[Bug]: Engine iteration timed out. This should never happen!

> Also, is there code I can try reproducing it in our env? We're were sending requests directly to the vllm container using `curl`, without any in-house code. The container...

[Bug]: Engine iteration timed out. This should never happen!

> Hmm it is actually interesting PID 7065 is running nothing. It might be the root cause of hanging. Since around that logit access code, all the workers need to...

[Bug]: Engine iteration timed out. This should never happen!

> also one interesting thing is you use `--enable-prefix-caching `. Does it still hang without this flag? (can you just check)? I can try reproducing it on my end in...