nano-vllm
nano-vllm copied to clipboard
A potential problem with GPU memory allocation
Using 2 GPUs for tensor parallel inference: The size of num_blocks in BlockManager is determined by the remaining memory on GPU0. When allocating blocks, if the remaining memory on GPU0 (e.g. 40GB) is greater than the remaining memory on GPU1 (e.g. other processes occupy some memory, leaving only 20GB for nano-vllm), the actual number of blocks that GPU1 can allocate is less than num_blocks in BlockManager. Will this cause a problem?
Yes, this could indeed cause problems. A more robust approach would be to:
-
Calculate the maximum number of blocks each GPU can allocate independently based on its own available memory.
-
Perform an
all_reduceoperation with the MIN operator across all GPUs to determine a commonnum_blocks.
Thank your consideration!