LLMAssistantContextAggregator: always aggregate TTSTextFrame
Please describe the changes in your PR. If it is addressing an issue, please reference that as well.
This is the assistant aggregator and we should add everything that is being spoekn. Because of that we should use TTSTextFrame because those are the frames that are actually spoken. We should send the aggregation as soon as the bot stops speaking.
So, we don't need to handle LLMFullResponseStartFrame and LLMFullResponseEndFrame anymore.
Converting this to draft for now. I'm not 100% sure this is how it should be implemented. Even though it seems to make sense (because we only add TTSTextFrame to the context) we need to give it more thought.
One good thing about this approach is that whatever we push through TTSSpeakFrame will always be part of the context, creating a unified behavior.
Because with the current approach, it is included only if sent during a function call (between LLMFullResponseStartFrame and LLMFullResponseEndFrame). Which feels like a bug, where sometimes is included and sometimes not.
One good thing about this approach is that whatever we push through
TTSSpeakFramewill always be part of the context, creating a unified behavior.Because with the current approach, it is included only if sent during a function call (between
LLMFullResponseStartFrameandLLMFullResponseEndFrame). Which feels like a bug, where sometimes is included and sometimes not.
Yes, but another problem to consider is that currently we can support pipelines without audio, for example, text-based applications. With this change we are making it so the assistant aggregator is only useful for audio cases.
Yes, but another problem to consider is that currently we can support pipelines without audio, for example, text-based applications. With this change we are making it so the assistant aggregator is only useful for audio cases.
True, I had not considered that. Agered, we need to give this more thought so we don’t miss any use cases. 👍
One option is to pass a text_only_pipeline: bool to PipelineParams. This woul be used when the aggregator is initialized and would result in using the current LLMFullResponseStartFrame/LLMFullResponseEndFrame behavior.
One option is to pass a
text_only_pipeline: booltoPipelineParams. This woul be used when the aggregator is initialized and would result in using the currentLLMFullResponseStartFrame/LLMFullResponseEndFramebehavior.
But that would mean we can't have a pipeline that does both at the same time. That said, in open-sesame we have two different pipelines.
What issue are we trying to solve? Is it that the TTSSpeakFrames aren't added to the context?
What issue are we trying to solve? Is it that the TTSSpeakFrames aren't added to the context?
Yes. We currently, after a recent change, only add them if they are pushed inside a function call. But ideally we should always add it.
This issue mentions the problems with text-only pipelines and TTSTextFrame, so perhaps this could be limited to TTSSpeakFrames only for now?
Alternately, could this be a parameter that is set somewhere in the pipeline or on a processor?
Such as:
- TTSSpeakFrame --add_to_assistant_contexts: bool # Add to any downstream assistant contexts
- TTSSpeakFrame -- add_to_context: LLMAssistantContextAggregator # Add to a specific assistant context
- LLMAssistantContextAggregator -- add_tts_speak_frames: bool # Add all downstream TTSSpeakFrames
- etc.
Another option would be a convenient method to manually append text to an assistant context.
@aconchillo should we close this one?
Closing this as enough things in the aggregators have changed.