shyringo
shyringo
I've also discovered this issue. Wondering if aynone is interested in solving it. It would only take a few judgements and lines.
> I find that we need to explicitly run "del llm.llm_engine.driver_worker" to release in when using a single worker. Can anybody explain why this is the case? I tried the...
> I tried the above code block and also this line "del llm.llm_engine.driver_worker". Both failed for me. > > But I managed, with the following code, to terminate the vllm.LLM(),...
> Tried this including `ray.shutdown()` but the memory is not released on my end, any other suggestion? could try the "del llm.llm_engine.model_executor" in the following code instead: > update: the...
> did that as well, still no change in gpu memory allocation. Not sure how to go further Then I do not have a clue either. Meanwhile, I should add...
> > this issue makes vllm impossible for production use > > At present, we have found a workaround and set the swap space directly to 0. This way, we...
Met the same issue in Offline Batched Inference. Wouldn't continue when stuck in the line `LLM()`. GPU memory usage was occupied, but GPU utilization was 0%.
#1908 might be related, but in 'Offline Batched Inference' mode.