FunASR icon indicating copy to clipboard operation
FunASR copied to clipboard

Issue with Multi-GPU Inference in Xinference Using vLLM for Model Loading

Open Bc-Aqr opened this issue 8 months ago • 0 comments

I am currently facing an issue with using multiple GPUs simultaneously when running inference on vLLM with Xinference. The setup works correctly when using a single GPU with smaller models, but it fails when trying to run multi-GPU inference for larger models. Below is the detailed description of the problem and my environment setup.

+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 D Off | 00000000:3B:00.0 Off | Off | | 0% 35C P8 20W / 425W | 13MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4090 D Off | 00000000:86:00.0 Off | Off | | 30% 30C P8 8W / 425W | 13MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| +-----------------------------------------------------------------------------------------+

xinference.log

Bc-Aqr avatar Mar 05 '25 05:03 Bc-Aqr