exllama
exllama copied to clipboard
Why is there a huge lag between reading the prompt and starting to generate output?
I have the same problem either in ooba or directly cloning this rep. I'm not sure what is causing this.
It seems the prompt is read fast but there is a big delay until the output starts. I cant say it definitely but it seems the longer the prompt the longer the lag gets.
I have a 1080ti and a 1080 on Ubuntu Linux. I reduced the watt usage a bit but that shouldn't cause that huge lag. Times are a 13b Model. Chronos-13B-SuperHOT-8K-GPTQ But I have this issue for weeks with multiple models. Fitting everything on 1 GPU doesn't improve it either. Is it because they are pascal cards?
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 46% 64C P2 78W / 220W | 10715MiB / 11264MiB | 20% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| 0% 44C P8 9W / 200W | 9MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Times with ooba (those include everything, prompt+waitingtime+output):
Output generated in 13.96 seconds (0.57 tokens/s, 8 tokens, context 39, seed 1543145771) Output generated in 12.50 seconds (0.72 tokens/s, 9 tokens, context 39, seed 192406504
Times with this repo: (its not including the time from prompt finish until the output starts unfortunately. it does show that prompt is fast though.)
Prompt processed in 0.94 seconds, 62 new tokens, 65.72 tokens/second:
Chatbort: Test received! What would you like me to do with this message? Perhaps you have an interesting puzzle or challenge for me to tackle? Or maybe there's something specific you want me to learn more about? Whatever it is, I'm eager to assist in any way possible!
Response generated in 2.8e+01 seconds, 66 tokens, 2.38 tokens/second:
I'll have a look later. It may be there's some delay there I just never noticed because prompt processing is literally a hundred times faster on the 4090, apparently. But just to be clear, using ExLlama's web UI, you're seeing a noticeable delay between when it shows "prompt processed" in the console and when it starts streaming to the UI?
thanks for taking a look, much appreciated!!
yep, the prompt is finished but the waiting time is pretty brutal.
This gif is displaying the konsole + browser.
Okay, that is quite a delay there. I had a look and there's no processing happening between when it prints the prompt speed and when it creates the frame to write the response into. I can only conclude that the prompt hasn't actually finished processing by the time the server outputs "Prompt processed in..." and it actually still has a long CUDA queue to finish in the background. Then it stalls when running the forward pass on "\nChatbort:" waiting for the processing to actually finish.
You can confirm by inserting this on line 616 of webui/session.py, right before end_time = time.time():
torch.cuda.synchronize()
This should give a more accurate reading of the prompt speed. I hadn't really paid much attention to this because the prompt processing speed is usually several thousand tokens per second on my setup. I guess it really isn't optimized for Pascal GPUs.
One thing you could try is, at the very end of exllama_ext/cuda_func/q4_matmul.cu, switching to the SgemmEx variant of the cuBLAS matmul. It's already there, just commented out. You want it to look like this:
// const half alpha = __float2half(1.0f);
// const half beta = no_zero ? __float2half(1.0f) : __float2half(0.0f);
// cublasHgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, width, height, dim, &alpha, buffers->temp_dq, width, x_mapped, dim, &beta, out, width);
const float alpha = 1.0f;
const float beta = no_zero ? 1.0f : 0.0f;
cublasSgemmEx(handle, CUBLAS_OP_N, CUBLAS_OP_N, width, height, dim, &alpha, buffers->temp_dq, CUDA_R_16F, width,
x_mapped, CUDA_R_16F, dim, &beta, out, CUDA_R_16F, width);
This may give you somewhat better performance on Pascal, at least for the prompt. Will use up a bit more memory though.
That did it!!
I'm sending the 2 times with the .synchronize() code you gave me. This did indeed properly show the prompt times.
Before the fix:
Prompt processed in 20.48 seconds, 91 new tokens, 4.44 tokens/second: Response generated in 1.1e+01 seconds, 64 tokens, 5.77 tokens/second:
After the fix:
Prompt processed in 0.84 seconds, 91 new tokens, 108.21 tokens/second: Response generated in 9.3 seconds, 54 tokens, 5.78 tokens/second:
The response starts instant now. Very cool. Thanks so much for helping me out so quickly.