FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

when will add the stream feature in api?

Open ashinwz opened this issue 2 years ago • 12 comments

ashinwz avatar Apr 23 '23 15:04 ashinwz

what do you mean by stream feature? aren't our current CLI and web interface both streaming?

zhisbug avatar Apr 25 '23 08:04 zhisbug

I mean the api response message should support the streaming type and would there be a post parameter to turn up/off the streaming (true/false)?

Reference to openai

stream Boolean

If set, partial message deltas will be sent, like in ChatGPT. Tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message. See the OpenAI Cookbook for example code.

ashinwz avatar Apr 25 '23 09:04 ashinwz

That would be great, I'm joining the request please.

Hazoom avatar Apr 25 '23 09:04 Hazoom

Joining the request. Need stream in API.

kaust2018 avatar Apr 26 '23 06:04 kaust2018

I need this as well!

real-limitless avatar Apr 29 '23 04:04 real-limitless

I was looking into this and to make this work there has to be to adjustment the api server to accept a parameter "stream".

JSON CURL

{
     "model": "vicuna-13b",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "stream": true  

 }

Or

API Python

completion = client.ChatCompletion.create(
  model="vicuna-13b",
  messages=[
    {"role": "user", "content": content}
  ]
  steam=true
)

In the API server this feature shouldnt be to difficult because the Web-server where the chat menu is already has text streaming built.

fastChat/fastchat/serve/gradio_web_server.py

    try:
        # Stream output
        response = requests.post(
            worker_addr + "/worker_generate_stream",
            headers=headers,
            json=gen_params,
            stream=True,
            timeout=20,
        )

A modification in the file FastChat/fastchat/serve/api.py inside the async function chat_completion would be nessary to stream out chunks exactly like how OpenAI does it. We'd just have to emulate OpenAIs delta chunks to have native api support as well for other Applications being built that require streaming. Like voice chat bots etc.. etc..

{
  "choices": [
    {
      "delta": {
        "content": "1"            <<< This is the letters/words.
      },
      "finish_reason": null,
      "index": 0
    }
  ],
  "created": 1680380941,
  "id": "chatcmpl-70c8LVUSYoSbdQTyONgJfcVU542wO",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion.chunk"
}

# ... lots more here ...

{
  "choices": [
    {
      "delta": {
        "content": "ina"           <<< This is the letters/words.
      },
      "finish_reason": null,
      "index": 0
    }
  ],
  "created": 1680380941,
  "id": "chatcmpl-70c8LVUSYoSbdQTyONgJfcVU542wO",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion.chunk"
}

A modification in FastChat/fastchat/client/api.py some where in the ChatCompletionClient class and the ChatCompletion class as well. to have native streaming just like the OpenAI sdk.

async for chunk in await openai.ChatCompletion.acreate(
    model="gpt-3.5-turbo",
    messages=[{
        "role": "user",
        "content": "Generate a list of 20 great names for sentient cheesecakes that teach SQL"
    }],
    stream=True,
):
    content = chunk["choices"][0].get("delta", {}).get("content")
    if content is not None:
        print(content, end='')

This page really walks through streaming delta chunks from the OpenAI api.

https://til.simonwillison.net/gpt3/python-chatgpt-streaming-api

I think this is a good high level look at what changes are needed. If anyone else has any other ideas, feel free to pitch them in as well.

This does not look to difficult to do.

real-limitless avatar Apr 29 '23 14:04 real-limitless

I need too

critejon avatar May 01 '23 18:05 critejon

Hi, I've started to work on this issue.

baradm100 avatar May 03 '23 03:05 baradm100

I need too

Good0007 avatar May 04 '23 02:05 Good0007

The PR is ready and tested: https://github.com/lm-sys/FastChat/pull/858

Feel free to review!

baradm100 avatar May 06 '23 16:05 baradm100

@baradm100 where's your tip jar?

real-limitless avatar May 06 '23 17:05 real-limitless

welp, i stupidly implemented this myself also in #873 without having checked other PRs first... lol

actually, looking at the two candidate PRs, i shouldn't call myself stupid. my version appears to make far fewer edits to achieve the same purpose.

i guess maintainers have some options now so this feature should see upstream shortly!

kfatehi avatar May 06 '23 21:05 kfatehi

This is a demanding feature and thank all of you for your contributions!

I will try to merge #873 and #858

merrymercy avatar May 08 '23 10:05 merrymercy

@merrymercy @baradm100

Hey, I was able to test this.

I was able to get delta token updates on my responses. However, there is some wiredness with the api working with langchain with Streaming turned on.

I get an empty response when streaming is turned on. Should a bug be entered here or on Langchain ?

import os 

import streamlit as st 
from langchain.chat_models import ChatOpenAI as OpenAIChat
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain, SequentialChain 
from langchain.memory import ConversationBufferMemory
from langchain.utilities import WikipediaAPIWrapper,BingSearchAPIWrapper
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

from langchain.chat_models import ChatOpenAI
from langchain import PromptTemplate, LLMChain
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    AIMessagePromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)


os.environ['OPENAI_API_BASE'] = "http://localhost:8000/v1"


llm = OpenAIChat(openai_api_base="http://localhost:8000/v1",callbacks=[StreamingStdOutCallbackHandler()],  model_name="vicuna-13B",streaming=False,verbose=True)




template="You are a helpful assistant that translates english to pirate."
system_message_prompt = SystemMessagePromptTemplate.from_template(template)
example_human = HumanMessagePromptTemplate.from_template("Hi")
example_ai = AIMessagePromptTemplate.from_template("Argh me mateys")
human_template="{text}"
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)


chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, example_human, example_ai, human_message_prompt])
chain = LLMChain(llm=llm, prompt=chat_prompt)


resp = chain.run("I love Red Hat!")
print(resp)


####  RESP
####  Aye, Red Hat be a fine operating system, arrr!


#### With Streaming

llm = OpenAIChat(openai_api_base="http://localhost:8000/v1",callbacks=[StreamingStdOutCallbackHandler()],  model_name="vicuna-13B",streaming=False,verbose=True)




template="You are a helpful assistant that translates english to pirate."
system_message_prompt = SystemMessagePromptTemplate.from_template(template)
example_human = HumanMessagePromptTemplate.from_template("Hi")
example_ai = AIMessagePromptTemplate.from_template("Argh me mateys")
human_template="{text}"
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)





chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, example_human, example_ai, human_message_prompt])
chain = LLMChain(llm=llm, prompt=chat_prompt)


resp = chain.run("I love Red Hat!")
print(resp)
#### RESP
#### [Empty Line]

real-limitless avatar May 08 '23 21:05 real-limitless