ServerlessLLM icon indicating copy to clipboard operation
ServerlessLLM copied to clipboard

[Feature Request] support stream chat

Open dblate opened this issue 1 year ago • 8 comments

Prerequisites

  • [X] I have searched existing issues and reviewed documentation.

Problem Description

In chat application, the generate api need to support stream mode to improve user experience.

Proposed Solution

Follow the open ai chat api, add stream: true to support streaming response by SSE, for example

curl http://127.0.0.1:8343/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
        "model": "facebook/opt-1.3b",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is your name?"}
        ],
        "stream": true
    }'

Alternatives Considered

No response

Additional Context

No response

Importance

Important

Usage Statistics (Optional)

No response

dblate avatar Dec 03 '24 02:12 dblate

This is a great feature to have! If anyone is interested in working on it, please leave a comment here, and we can assign it to you. We can also use this thread to discuss the implementation details.

future-xy avatar Dec 06 '24 17:12 future-xy

I'd like to implement this feature, please assign it to me.

dblate avatar Dec 10 '24 01:12 dblate

I encountered a problem when implementing this feature:

  • For the server side, I can return StreamingResponse(generator, media_type="text/event-stream") for http stream mode.
  • For the backend side, use model.generate(input_ids, streamer=streamer) and get token from streamer to generate tokens in a streaming manner

But how can I pass the streamer or the generator from backend to server, since Ray do not support pass ObjectRefGenerator to another actor(as far as I know)? Now the generator delivery order is transformer_backend -> request_router -> http_server

I'd appreciate it if anyone can provide some suggestions or solutions.

dblate avatar Dec 23 '24 07:12 dblate

By the way, I have found that ray.util.queue.Queue can be used to pass through ray actors, I will try it.

dblate avatar Dec 23 '24 09:12 dblate

Hi! That's a great question—thank you for providing the detailed context!

It seems that Ray Actors can work with streaming responses, similar to Python iterators using yield. Have you considered implementing this using Ray Generators? They might help bypass the limitations of passing ObjectRefGenerator between actors.

Alternatively, would it simplify the setup if Ray wasn't used in ServerlessLLM? Instead, we could route requests (streaming or non-streaming) between regular HTTP servers.

Looking forward to your thoughts!

future-xy avatar Dec 24 '24 19:12 future-xy

Yes, I have tried Ray Generators, and I found that there were some weird problems/bugs especially when Ray Generators used between actors. Fortunately, I have somehow solved them, and there will be a first version commit soon.

In addition, I strongly agree that Ray introduced a certain amount of complexity into this project. But without Ray, can we achieve cluster auto scaling or other related features? I am not very clearly about that.

As far as I know, fastchat is using http communication between servers. But it does not seem to have the feature of auto scaling.

Looking forward to your thoughts about other architectures.

dblate avatar Dec 25 '24 12:12 dblate

I'm here to update the status. I'm still working on this feature. I was too busy at the end of last year, so the progress was delayed.

dblate avatar Feb 07 '25 08:02 dblate

I'm here to update the status. I'm still working on this feature. I was too busy at the end of last year, so the progress was delayed.

No worries! By the way, the latest main branch includes optimizations to the build process, which you might find useful to merge into your development branch.

future-xy avatar Feb 13 '25 21:02 future-xy