vllm Why using LLM class to load models requires much more memory than using huggingface from

Why using LLM class to load models requires much more memory than using huggingface from_pretrained method?

Open canghongjian opened this issue 1 year ago • 3 comments

I tried the code llm = LLM(model="facebook/opt-125m") on a single T4 and found the memory cost exceeded 11GB, while using huggingface code model = AutoModel.from_pretrained("facebook/opt-125m").cuda() only cost 1GB memory. How much memory should I reserve to use vllm at least?

Jul 04 '23 08:07 canghongjian

Hi @canghongjian, thanks for trying out vLLM! vLLM runs a simple memory profiling and pre-allocates 90% of the total GPU memory for its weight and activation. You can configure this ratio by adding gpu_memory_utilization= in initializing LLM.

Jul 04 '23 08:07 WoosukKwon

Hi @canghongjian, thanks for trying out vLLM! vLLM runs a simple memory profiling and pre-allocates 90% of the total GPU memory for its weight and activation. You can configure this ratio by adding gpu_memory_utilization= in initializing LLM.

Got it.

Jul 04 '23 08:07 canghongjian

Hi @canghongjian, thanks for trying out vLLM! vLLM runs a simple memory profiling and pre-allocates 90% of the total GPU memory for its weight and activation. You can configure this ratio by adding gpu_memory_utilization= in initializing LLM.

Thanks for your kind reply @WoosukKwon . Besides, I also want to know whether vllm could help model process longer input texts at the same memory cost. Seems like your report focus on the inference speed.

Jul 04 '23 12:07 canghongjian

Thanks for your kind reply @WoosukKwon . Besides, I also want to know whether vllm could help model process longer input texts at the same memory cost. Seems like your report focus on the inference speed.

Not really. vLLM will not reduce the actual memory usage for one request. It only reduces memory waste. For one request, it requires machine learning modifications or techniques like swapping to reduce memory usage. Feel free to re-open the issue if you have any further questions.

Jul 18 '23 05:07 zhuohan123

vllm vllm copied to clipboard

Why using LLM class to load models requires much more memory than using huggingface from_pretrained method?

vllm
vllm copied to clipboard