nano-vllm icon indicating copy to clipboard operation
nano-vllm copied to clipboard

A potential problem with GPU memory allocation

Open slwang-ustc opened this issue 4 months ago • 1 comments

Using 2 GPUs for tensor parallel inference: The size of num_blocks in BlockManager is determined by the remaining memory on GPU0. When allocating blocks, if the remaining memory on GPU0 (e.g. 40GB) is greater than the remaining memory on GPU1 (e.g. other processes occupy some memory, leaving only 20GB for nano-vllm), the actual number of blocks that GPU1 can allocate is less than num_blocks in BlockManager. Will this cause a problem?

slwang-ustc avatar Jul 22 '25 08:07 slwang-ustc

Yes, this could indeed cause problems. A more robust approach would be to:

  1. Calculate the maximum number of blocks each GPU can allocate independently based on its own available memory.

  2. Perform an all_reduce operation with the MIN operator across all GPUs to determine a common num_blocks.

Thank your consideration!

GeeeekExplorer avatar Nov 03 '25 17:11 GeeeekExplorer