exllama
exllama copied to clipboard
GPU Usage Keeps High Even Without Inference Load
Configuration:
AMD W7900 + Rocm5.6
Running the model on oobabooga/text-generation-webui, GPU Usage keeps even unload the model. Model: TheBloke/Llama-2-70B-GPTQ:gptq-4bit-32g-actorder_True
Running meta-llama/Llama-2-7b-chat-hf without quantiziation would not have this issue.
Is it an expected behavior?
It's not expected, no. I have no explanation for it. Are you able to generate anything while the GPU is in this state?
It's not expected, no. I have no explanation for it. Are you able to generate anything while the GPU is in this state?
After I unload the model, there is still 1% in the VRAM, in this state I am able to load another model and do inference, but the GPU usage will always be high. Only kill the WebUI process fix the issue.
It must be a ROCm issue of some sort, because there's nothing running in the background, no threads or anything. There's the asynchronous device queue, but the host code synchronizes at multiple points and shouldn't be able to run at all while there are kernels still running.
Does that management interface provide any sort of additional insight into what might be running on the GPU? I.e. is it a rogue kernel, the runtime stuck in a loop trying to clean up corrupted memory, or... idk. I would strongly suspect it's ROCm specific.
It must be a ROCm issue of some sort, because there's nothing running in the background, no threads or anything. There's the asynchronous device queue, but the host code synchronizes at multiple points and shouldn't be able to run at all while there are kernels still running.
Does that management interface provide any sort of additional insight into what might be running on the GPU? I.e. is it a rogue kernel, the runtime stuck in a loop trying to clean up corrupted memory, or... idk. I would strongly suspect it's ROCm specific.
It seems like someone keeps laucning kerenls in the loading process.
Well, it's a Torch kernel (elementwise_kernel) which unfortunately is called all the time for any sort of element-wise operation, so it's anyone's guess what it's doing.
But it's definitely a Torch operation that keeps firing in the background. And since ExLlama is single-threaded I can't imagine a way it could keep launching kernels like this unless it was stuck in a loop. But if it were stuck in a loop you wouldn't be able to load another model and use it.
So it sounds like this is an issue with TGW. Maybe @oobabooga has some idea what might be going on? Is ExLlama launched in a separate process or a separate thread?
In text-generation-webui the generation runs on a separate thread, yes. It is done using the Iteratorize class here: https://github.com/oobabooga/text-generation-webui/blob/main/modules/callbacks.py#L30
But I have never experienced any idle GPU usage.
Well, it's a Torch kernel (
elementwise_kernel) which unfortunately is called all the time for any sort of element-wise operation, so it's anyone's guess what it's doing.But it's definitely a Torch operation that keeps firing in the background. And since ExLlama is single-threaded I can't imagine a way it could keep launching kernels like this unless it was stuck in a loop. But if it were stuck in a loop you wouldn't be able to load another model and use it.
So it sounds like this is an issue with TGW. Maybe @oobabooga has some idea what might be going on? Is ExLlama launched in a separate process or a separate thread?
By adding log in model.py, I am able to narrow down the issue to "cuda_ext.exllama_ext.prepare_buffers" function.