vllm
vllm copied to clipboard
Why using LLM class to load models requires much more memory than using huggingface from_pretrained method?
I tried the code llm = LLM(model="facebook/opt-125m")
on a single T4 and found the memory cost exceeded 11GB, while using huggingface code model = AutoModel.from_pretrained("facebook/opt-125m").cuda()
only cost 1GB memory. How much memory should I reserve to use vllm at least?
Hi @canghongjian, thanks for trying out vLLM! vLLM runs a simple memory profiling and pre-allocates 90% of the total GPU memory for its weight and activation. You can configure this ratio by adding gpu_memory_utilization=
in initializing LLM
.
Hi @canghongjian, thanks for trying out vLLM! vLLM runs a simple memory profiling and pre-allocates 90% of the total GPU memory for its weight and activation. You can configure this ratio by adding
gpu_memory_utilization=
in initializingLLM
.
Got it.
Hi @canghongjian, thanks for trying out vLLM! vLLM runs a simple memory profiling and pre-allocates 90% of the total GPU memory for its weight and activation. You can configure this ratio by adding
gpu_memory_utilization=
in initializingLLM
.
Thanks for your kind reply @WoosukKwon . Besides, I also want to know whether vllm could help model process longer input texts at the same memory cost. Seems like your report focus on the inference speed.
Thanks for your kind reply @WoosukKwon . Besides, I also want to know whether vllm could help model process longer input texts at the same memory cost. Seems like your report focus on the inference speed.
Not really. vLLM will not reduce the actual memory usage for one request. It only reduces memory waste. For one request, it requires machine learning modifications or techniques like swapping to reduce memory usage. Feel free to re-open the issue if you have any further questions.