exllama GPU Usage Keeps High Even Without Inference Load

Configuration: AMD W7900 + Rocm5.6

Running the model on oobabooga/text-generation-webui, GPU Usage keeps even unload the model. Model: TheBloke/Llama-2-70B-GPTQ:gptq-4bit-32g-actorder_True

Running meta-llama/Llama-2-7b-chat-hf without quantiziation would not have this issue.

Is it an expected behavior?

Aug 19 '23 06:08 leonxia1018

It's not expected, no. I have no explanation for it. Are you able to generate anything while the GPU is in this state?

Aug 19 '23 13:08 turboderp

It's not expected, no. I have no explanation for it. Are you able to generate anything while the GPU is in this state?

After I unload the model, there is still 1% in the VRAM, in this state I am able to load another model and do inference, but the GPU usage will always be high. Only kill the WebUI process fix the issue.

Aug 19 '23 14:08 leonxia1018

It must be a ROCm issue of some sort, because there's nothing running in the background, no threads or anything. There's the asynchronous device queue, but the host code synchronizes at multiple points and shouldn't be able to run at all while there are kernels still running.

Does that management interface provide any sort of additional insight into what might be running on the GPU? I.e. is it a rogue kernel, the runtime stuck in a loop trying to clean up corrupted memory, or... idk. I would strongly suspect it's ROCm specific.

Aug 19 '23 15:08 turboderp

It must be a ROCm issue of some sort, because there's nothing running in the background, no threads or anything. There's the asynchronous device queue, but the host code synchronizes at multiple points and shouldn't be able to run at all while there are kernels still running.

Does that management interface provide any sort of additional insight into what might be running on the GPU? I.e. is it a rogue kernel, the runtime stuck in a loop trying to clean up corrupted memory, or... idk. I would strongly suspect it's ROCm specific.

It seems like someone keeps laucning kerenls in the loading process.

Aug 20 '23 14:08 leonxia1018

Well, it's a Torch kernel (elementwise_kernel) which unfortunately is called all the time for any sort of element-wise operation, so it's anyone's guess what it's doing.

But it's definitely a Torch operation that keeps firing in the background. And since ExLlama is single-threaded I can't imagine a way it could keep launching kernels like this unless it was stuck in a loop. But if it were stuck in a loop you wouldn't be able to load another model and use it.

So it sounds like this is an issue with TGW. Maybe @oobabooga has some idea what might be going on? Is ExLlama launched in a separate process or a separate thread?

Aug 21 '23 08:08 turboderp

In text-generation-webui the generation runs on a separate thread, yes. It is done using the Iteratorize class here: https://github.com/oobabooga/text-generation-webui/blob/main/modules/callbacks.py#L30

But I have never experienced any idle GPU usage.

Aug 21 '23 11:08 oobabooga

Well, it's a Torch kernel (elementwise_kernel) which unfortunately is called all the time for any sort of element-wise operation, so it's anyone's guess what it's doing.

But it's definitely a Torch operation that keeps firing in the background. And since ExLlama is single-threaded I can't imagine a way it could keep launching kernels like this unless it was stuck in a loop. But if it were stuck in a loop you wouldn't be able to load another model and use it.

So it sounds like this is an issue with TGW. Maybe @oobabooga has some idea what might be going on? Is ExLlama launched in a separate process or a separate thread?

By adding log in model.py, I am able to narrow down the issue to "cuda_ext.exllama_ext.prepare_buffers" function.

Sep 13 '23 04:09 leonxia1018

exllama exllama copied to clipboard

GPU Usage Keeps High Even Without Inference Load

exllama
exllama copied to clipboard