text-generation-webui Lag/delay when clicking buttons in the gradio UI

Describe the bug

When clicking buttons in the gradio web interface, there's a noticeable delay before the button press is actually received by the python server. I am not sure whether this delay is inherent to gradio or solvable in any way, but it severely hampers one use-case: repeatedly generating with max tokens = 1 in order to closely supervise / collaborate with the model. So if there is any way to reduce or eliminate this delay that would be really nice

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Have your terminal on-screen, and click a button in the interface that usually results in console output (i.e. generate button, saving preset, etc). Not only will the web interface take a second to show the orange outline, but it will also take a second for anything to show up in console.

Devtools shows that the server takes around 400ms to respond to even a simple "stop" request when it isn't even generating anything. Maybe a python profiler should be taken to this.

Screenshot

No response

Logs

N/A

System Info

Operating System: Windows 11 Enterprise 64-bit (10.0, Build 22621)
Language: English (Regional Setting: English)
System Manufacturer: Micro-Star International Co., Ltd.
System Model: MS-7D97
BIOS: 1.20
Processor: 12th Gen Intel(R) Core(TM) i5-12400F (12 CPUs), ~5.2GHz
Memory: 16384MB RAM
Page file: 45178MB used, 3833MB available

Name: NVIDIA GeForce RTX 3060
Manufacturer: NVIDIA
Chip Type: NVIDIA GeForce RTX 3060
DAC Type: Integrated RAMDAC
Device Type: Full Display Device
Approx. Total Memory: 20250 MB
Display Memory (VRAM): 12129 MB
Shared Memory: 8121 MB
Current Display Mode: 3840 x 2160 (32 bit) (60Hz)
Monitor: Generic PnP Monitor
HDR: Supported

Sep 30 '23 23:09 LoganDark

Previously: #3621 #3202, closed by the retarded stale bot.

Sep 30 '23 23:09 LoganDark

Not stale.

Oct 11 '23 07:10 LoganDark

Not stale.

Oct 18 '23 15:10 LoganDark

Not stale.

Oct 25 '23 11:10 LoganDark

Not stale.

Nov 01 '23 17:11 LoganDark

Not stale.

Nov 09 '23 03:11 LoganDark

Not stale.

Nov 15 '23 21:11 LoganDark

Not stale.

Nov 22 '23 20:11 LoganDark

Not stale.

Nov 29 '23 17:11 LoganDark

Not stale.

Dec 06 '23 16:12 LoganDark

Not stale.

Dec 14 '23 09:12 LoganDark

Not stale.

Dec 20 '23 21:12 LoganDark

Same with me

Dec 24 '23 07:12 gonjay

When I use an older version of text-generation-webui, the problem doesn't seem to be so obvious, and I have a feeling that it's the logic of the gradio front-end js processing that might be a bit of a problem here

Dec 24 '23 07:12 gonjay

When I use an older version of text-generation-webui, the problem doesn't seem to be so obvious, and I have a feeling that it's the logic of the gradio front-end js processing that might be a bit of a problem here

It's the time that it takes the server to respond to requests; look in the network tab of the developer tools and you'll see. It's some hundreds of milliseconds, for a server running on localhost, which is insane. I know Python is slow, but it can't possibly be this slow. There's something wrong with gradio.

Dec 24 '23 07:12 LoganDark

The correct way to handle this here would be to change the state of the UI first, then send a network request and give a hint that the request is in progress, just like all other chat software.

When I use an older version of text-generation-webui, the problem doesn't seem to be so obvious, and I have a feeling that it's the logic of the gradio front-end js processing that might be a bit of a problem here

It's the time that it takes the server to respond to requests; look in the network tab of the developer tools and you'll see. It's some hundreds of milliseconds, for a server running on localhost, which is insane. I know Python is slow, but it can't possibly be this slow. There's something wrong with gradio.

Dec 24 '23 07:12 gonjay

The correct way to handle this here would be to change the state of the UI first

The issue is the delay in the server processing requests, not the lack of client-side progress bars or spinners (which are indeed present when you execute the model). Simply making the server respond faster would remove the need for any further mitigations

Dec 24 '23 07:12 LoganDark

Gradio is really fast for making small demos, but as it becomes more and more functional, it seems difficult to carry more heavy-duty business

Dec 24 '23 07:12 gonjay

Yeah I don't think the server has any business doing 400ms of processing before it even begins to serve a request

Dec 24 '23 07:12 LoganDark

I found the Generate code from module/ui_chat.py:

    shared.gradio['Generate'].click(
        ui.gather_interface_values, gradio(shared.input_elements), gradio('interface_state')).then(
        lambda x: (x, ''), gradio('textbox'), gradio('Chat input', 'textbox'), show_progress=False).then(
        chat.generate_chat_reply_wrapper, gradio(inputs), gradio('display', 'history'), show_progress=False).then(
        ui.gather_interface_values, gradio(shared.input_elements), gradio('interface_state')).then(
        chat.save_history, gradio('history', 'unique_id', 'character_menu', 'mode'), None).then(
        lambda: None, None, None, _js=f'() => {{{ui.audio_notification_js}}}')

Dec 24 '23 07:12 gonjay

I found the Generate code from module/ui_chat.py:

    shared.gradio['Generate'].click(
        ui.gather_interface_values, gradio(shared.input_elements), gradio('interface_state')).then(
        lambda x: (x, ''), gradio('textbox'), gradio('Chat input', 'textbox'), show_progress=False).then(
        chat.generate_chat_reply_wrapper, gradio(inputs), gradio('display', 'history'), show_progress=False).then(
        ui.gather_interface_values, gradio(shared.input_elements), gradio('interface_state')).then(
        chat.save_history, gradio('history', 'unique_id', 'character_menu', 'mode'), None).then(
        lambda: None, None, None, _js=f'() => {{{ui.audio_notification_js}}}')

Generate isn't a good benchmark because it's doing other work like starting the model. In the original post I used the stop button while the model wasn't running. This should do absolutely nothing because there is no actual work to do, but it takes 400ms for the server to even respond to the request

Dec 24 '23 07:12 LoganDark

Eventually I got too annoyed and moved to LM Studio instead, which is way easier to run, faster, better UI... it just doesn't do training, LoRAs, or non-GGUF models.

Dec 24 '23 07:12 LoganDark

I've also used LM Studio, but I'd rather have a Chat UI with user-friendly interactions that I can run in a browser, since I have various devices that need to run LLM. I've also tried various Chat UIs here: [LLM webui](https://www.reddit.com/r/LocalLLaMA/comments/1847qt6 /llm_webui_recommendations/), but none of them are good enough for my needs.

Dec 24 '23 07:12 gonjay

I've also used LM Studio, but I'd rather have a Chat UI with user-friendly interactions that I can run in a browser, since I have various devices that need to run LLM. I've also tried various Chat UIs here: [LLM webui](https://www.reddit.com/r/LocalLLaMA/comments/1847qt6 /llm_webui_recommendations/), but none of them are good enough for my needs.

I'm working on my own UI too, so maybe one day there will be that. But it is definitely some months out unfortunately.

Dec 24 '23 07:12 LoganDark

That's amazing to hear! Keep up the fantastic work, and I look forward to seeing the incredible results in the near future!

Dec 24 '23 07:12 gonjay

You might want to try this ChatGPT-Next below, next we just need to build an LLM JSON API with a format that is consistent with openai's APIs

Dec 24 '23 09:12 gonjay

I've been looking at my devtools and server responses very rarely take more than 2ms no matter what button I click. Console output is always instantaneous after pressing a button.

It would be great to see some actual results/benchmarks and/or video of the problem being reproduced or proven in some way.

Dec 24 '23 23:12 TheLounger

It was easy for me to reproduce this when I reported the issue. I'll record a video when I'm back home in a couple days. stupid christmas trip...

Dec 24 '23 23:12 LoganDark

I just cloned the repo fresh and tried again and I get mixed results.

Here's deleting a chat from the UI taking about half a second:

However, hitting Stop seems to be fast now at least when no model is being loaded:

however once the model is generating, latency starts to come back:

(this is discounting the fact that the generation took a couple seconds to actually stop)

I don't think it's my CPU being overloaded, since I'm testing with a 2-bit Phi-2 GGUF, which is positively tiny, and llama.cpp also uses only 4 cores by default. I haven't observed any difference in latency with or without my firewall running, so it's not that.

Some of the issue seems to have been fixed, but not all.

Dec 28 '23 21:12 LoganDark

I don't know if you are still experiencing issues, but I've found that different browsers behave very differently. I have found Opera to work best on mobile devices, perhaps trying a different browser will resolve things.

Dec 31 '23 19:12 RandomInternetPreson

text-generation-webui text-generation-webui copied to clipboard

Lag/delay when clicking buttons in the gradio UI

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui
text-generation-webui copied to clipboard