llama_index
llama_index copied to clipboard
[Question]: How to write the stream_complete function for my custom LLM (used for Gradio)?
Question Validation
- [X] I have searched both the documentation and discord for an answer.
Question
class HuatuoLLM(CustomLLM):
dummy_response: str = "My response"
@property
def metadata(self) -> LLMMetadata:
"""Get LLM metadata."""
return LLMMetadata(
context_window=2048,
#num_output=256,
model_name=model_name,
)
@llm_completion_callback()
def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
# only return newly generated tokens
messages = []
messages.append({"role": "user", "content": prompt})
text = model.HuatuoChat(tokenizer, messages)
#text,_ = model.chat(tokenizer, prompt, history=[])
return CompletionResponse(text=text)
@llm_completion_callback()
# def stream_complete(
# self, prompt: str, **kwargs: Any
# ) -> CompletionResponseGen:
# messages = []
# messages.append({"role": "user", "content": prompt})
# response = model.HuatuoChat(tokenizer, messages, stream=True)
# return response
def stream_complete(
self, prompt: str, **kwargs: Any
) -> CompletionResponseGen:
response = ""
for token in self.dummy_response:
response += token
yield CompletionResponse(text=response, delta=token)
What should I do to the stream_complete function? And how to get the stream output? Thanks for help!
The stream_complete
function in your custom LLM class is a generator function that yields a CompletionResponse
for each token in the dummy_response
string. This function is decorated with llm_completion_callback()
, which is designed to manage event dispatching and callback management for LLM completion tasks.
In the context of stream_complete
, the decorator does the following:
- Dispatches a
LLMCompletionStartEvent
before the function logic is executed. - Registers the start of an event with the callback manager.
- Executes the
stream_complete
function, which yieldsCompletionResponse
objects. - After each yield, it dispatches a
LLMCompletionEndEvent
and registers the event's end with the callback manager.
The stream_complete
function itself is designed to yield a CompletionResponse
for each token in the dummy_response
string. Each CompletionResponse
contains the current response
(the accumulated tokens up to the current token) and the delta
(the current token). This allows the function to provide a stream of responses, with each response reflecting the addition of a new token.
Here's a simplified explanation of how the stream_complete
function works:
- It initializes an empty string
response
. - It iterates over each token in
dummy_response
. - For each token, it adds the token to
response
and yields aCompletionResponse
with the currentresponse
and thedelta
(the current token).
This design allows the stream_complete
function to provide a stream of responses, with each response reflecting the addition of a new token.
However, please note that the actual implementation of the stream_complete
function would depend on the specifics of your LLM and how you want to handle streaming completions. The provided code is a simplified example and may not cover all the complexities of a real-world LLM.
For more information on the llm_completion_callback()
decorator and how it works, you can refer to the llama-index-core/llama_index/core/llms/callbacks.py file in the LlamaIndex repository.
Sources
- llama-index-legacy/llama_index/legacy/langchain_helpers/streaming.py
- llama-index-core/llama_index/core/llms/callbacks.py
About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.