text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Add api_streaming extension and update api-example-stream to use it

Open andysalerno opened this issue 1 year ago • 10 comments

I'm back with another possible enhancement to the streaming api :)

This time, I duplicated the "api" extension, but instead of an HTTP server I implemented a simple websocket server.

Then I updated the api-example-stream.py to use it.

I've been using this a lot locally, and it's been working very well. It's especially nice to not rely on hitting the raw Gradio API :D

Not a fan of gifs, but apparently Github doesn't like videos- output

andysalerno avatar Apr 10 '23 05:04 andysalerno

quick note, I just realized in the recording it is called "streaming_api," but before making the final PR I renamed it to "api_streaming" so it shows up next to "api" in alphanumeric order

andysalerno avatar Apr 10 '23 05:04 andysalerno

Following the discussion in https://github.com/oobabooga/text-generation-webui/issues/1268#event-9017892604, I think that the goal now should be, if possible, to merge this streaming API extension and the non-streaming API extension into a single one with two endpoints.

oobabooga avatar Apr 16 '23 18:04 oobabooga

Stop words and max new tokens doesnt work here

CyberTimon avatar Apr 21 '23 11:04 CyberTimon

Thanks for making this a working extension. The max new tokens can be fixed by replacing'max_length' with 'max_new_tokens' on line 35 of script.py, but the stop words look pretty broken right now. Its not clear if it should be passing it as a custom_stopping_strings string or a stopping_strings list.

xanthousm avatar Apr 22 '23 04:04 xanthousm

I'll be taking a look at cleaning this up tomorrow if I have time- updating the parameters, and combining with the non-streaming API as suggested.

andysalerno avatar Apr 22 '23 05:04 andysalerno

I'll give you a trophy 🏆 if you manage to combine the two APIs @andysalerno. For reference, the current gradio api enabled by default is this one:

https://github.com/oobabooga/text-generation-webui/blob/main/modules/api.py

I'd like to make this gradio api optional, and turn the "universal" api with two endpoints into the official one for this repository.

oobabooga avatar Apr 22 '23 06:04 oobabooga

I made some progress today, which you can see in the latest commits.

I see there are conflicts - later this evening I'll try to resolve.

But, there's still one tricky problem- When using --shared, the cloudflared reverse proxy does not seem to play nice with websockets. If I drop --shared, the websocket server is perfectly usable. It's also usable remotely if I manually proxy via ngrok (I never heard of cloudflared before but it seems to just be a reverse proxy like ngrok).

But one more thought. I'm not sure if it's a problem with cloudflared as a service (like maybe it doesn't handle the chatter of websockets well) or if it's a problem with how the streaming_api.py configures the service. Possibly there's a better way we can configure cloudflared, and suddenly it will start working. I'll give this a try sometime this weekend.

andysalerno avatar Apr 22 '23 20:04 andysalerno

(somehow the PR got closed while I was fixing merge conflicts, so I reopened)

Ok, I think I've solved the problem with streaming and cloudflared.

The short version is, when I spawn cloudflared from a new thread, suddenly streaming via the proxy works as expected. I'm not sure why this is the case, since internally the cloudeflared code is spawning a new process with Popen anyway. So I don't know why it matters that we spawn a new thread just to spawn a new process 🤷

At this point it's working perfectly for my own use. But I could use a second opinion to validate :)

  • I only have my desktop running Linux, so it'd be good to have someone verify on Windows/Mac. I think the cloudflared stuff is platform-specific because internally it's downloading an executable to run, so that needs a check.
  • I didn't spend much time validating the old non-streaming api, since I haven't used it much. So that could also use an extra validation.

andysalerno avatar Apr 23 '23 00:04 andysalerno

First of all, this is looking amazing! I have tested both the streaming and the blocking API locally on Linux and they worked perfectly.

I have made the following minor changes:

  • Add an --api flag that activates this extension automatically and a --public-api flag that is used in the start_server call in place of share.args, so that a gradio share link is not generated every time the user wants to launch a public API.
  • Use the default parameter names in the web UI by default, look for the alternative names as a fallback.

The public API url seems to have a small bug where it's only generated for the streaming server:

~$ python server.py --public-api
Starting API at None/api
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Starting streaming server at public url wss://....trycloudflare.com/api/v1/stream
Starting streaming server at public url https://....trycloudflare.com/api

I want to test it on Google Colab to see if it's worth it keeping the Gradio API defined in modules/api.py at all. This is the code for launching a test on Google Colab (pasting it here for me to reuse later):

!git clone https://github.com/andysalerno/text-generation-webui -b improved-async-streaming
%cd text-generation-webui
!pip install -r requirements.txt
!pip install -r extensions/api/requirements.txt
!python download-model.py facebook/galactica-125m
!python server.py --public-api

oobabooga avatar Apr 23 '23 15:04 oobabooga

The public API url seems to have a small bug where it's only generated for the streaming server.

Just pushed two commit fixing the above. The line "Starting API at None/api" needed to be removed. And the very bottom line "Starting streaming server at public url https..." is actually the non-streaming URL. I updated the message to say "non-streaming".

andysalerno avatar Apr 23 '23 16:04 andysalerno

I confirm that the blocking API is working as expected now over the cloudfare share link. I also confirmed that it works over Google Colab, so I removed the old Gradio API as it has no use case anymore.

The only thing left to do is test this on Windows as you suggested. I'll do that now.

oobabooga avatar Apr 23 '23 18:04 oobabooga

Working fine on Windows.

As promised, 🏆🏆🏆🏆🏆🏆🏆🏆🏆🏆🏆🏆🏆 to @andysalerno for the achievement! This is a major improvement to the web UI, as it did not have a proper and robust API like this one until now. Thanks a lot for this PR.

oobabooga avatar Apr 23 '23 18:04 oobabooga

"Why does this API stuff not work?" Because i was a 24 hours behind the bleeding edge of updates. Now it runs fine, good stuff!

Smileynator avatar Apr 23 '23 20:04 Smileynator

The websocket server sends only the first message separately, then sends all the other messages batched (after the whole response is generated). At first I thought it was my client, then checked wireshark:

image

The "await websocket.send" in streaming_api.py seems to be called multiple times, but all the messages from messagenum2 to stream_end are sent together at the very end.

MajdajkD avatar Apr 24 '23 06:04 MajdajkD

Found a solution here: https://websockets.readthedocs.io/en/stable/faq/server.html

Putting "await asyncio.sleep(0)" in the sending loop solves the problem.

MajdajkD avatar Apr 24 '23 06:04 MajdajkD

@MajdajkD can you submit the exact change that you made in a PR?

oobabooga avatar Apr 24 '23 06:04 oobabooga