[Feature Request] support stream chat
Prerequisites
- [X] I have searched existing issues and reviewed documentation.
Problem Description
In chat application, the generate api need to support stream mode to improve user experience.
Proposed Solution
Follow the open ai chat api, add stream: true to support streaming response by SSE, for example
curl http://127.0.0.1:8343/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-1.3b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is your name?"}
],
"stream": true
}'
Alternatives Considered
No response
Additional Context
No response
Importance
Important
Usage Statistics (Optional)
No response
This is a great feature to have! If anyone is interested in working on it, please leave a comment here, and we can assign it to you. We can also use this thread to discuss the implementation details.
I'd like to implement this feature, please assign it to me.
I encountered a problem when implementing this feature:
- For the server side, I can return
StreamingResponse(generator, media_type="text/event-stream")for http stream mode. - For the backend side, use
model.generate(input_ids, streamer=streamer)and get token from streamer to generate tokens in a streaming manner
But how can I pass the streamer or the generator from backend to server, since Ray do not support pass ObjectRefGenerator to another actor(as far as I know)? Now the generator delivery order is transformer_backend -> request_router -> http_server
I'd appreciate it if anyone can provide some suggestions or solutions.
By the way, I have found that ray.util.queue.Queue can be used to pass through ray actors, I will try it.
Hi! That's a great question—thank you for providing the detailed context!
It seems that Ray Actors can work with streaming responses, similar to Python iterators using yield. Have you considered implementing this using Ray Generators? They might help bypass the limitations of passing ObjectRefGenerator between actors.
Alternatively, would it simplify the setup if Ray wasn't used in ServerlessLLM? Instead, we could route requests (streaming or non-streaming) between regular HTTP servers.
Looking forward to your thoughts!
Yes, I have tried Ray Generators, and I found that there were some weird problems/bugs especially when Ray Generators used between actors. Fortunately, I have somehow solved them, and there will be a first version commit soon.
In addition, I strongly agree that Ray introduced a certain amount of complexity into this project. But without Ray, can we achieve cluster auto scaling or other related features? I am not very clearly about that.
As far as I know, fastchat is using http communication between servers. But it does not seem to have the feature of auto scaling.
Looking forward to your thoughts about other architectures.
I'm here to update the status. I'm still working on this feature. I was too busy at the end of last year, so the progress was delayed.
I'm here to update the status. I'm still working on this feature. I was too busy at the end of last year, so the progress was delayed.
No worries! By the way, the latest main branch includes optimizations to the build process, which you might find useful to merge into your development branch.