pipecat LLMAssistantContextAggregator: always aggregate TTSTextFrame

Please describe the changes in your PR. If it is addressing an issue, please reference that as well.

This is the assistant aggregator and we should add everything that is being spoekn. Because of that we should use TTSTextFrame because those are the frames that are actually spoken. We should send the aggregation as soon as the bot stops speaking.

So, we don't need to handle LLMFullResponseStartFrame and LLMFullResponseEndFrame anymore.

Feb 20 '25 23:02 aconchillo

Converting this to draft for now. I'm not 100% sure this is how it should be implemented. Even though it seems to make sense (because we only add TTSTextFrame to the context) we need to give it more thought.

Feb 21 '25 01:02 aconchillo

One good thing about this approach is that whatever we push through TTSSpeakFrame will always be part of the context, creating a unified behavior.

Because with the current approach, it is included only if sent during a function call (between LLMFullResponseStartFrame and LLMFullResponseEndFrame). Which feels like a bug, where sometimes is included and sometimes not.

Feb 21 '25 11:02 filipi87

One good thing about this approach is that whatever we push through TTSSpeakFrame will always be part of the context, creating a unified behavior.

Because with the current approach, it is included only if sent during a function call (between LLMFullResponseStartFrame and LLMFullResponseEndFrame). Which feels like a bug, where sometimes is included and sometimes not.

Yes, but another problem to consider is that currently we can support pipelines without audio, for example, text-based applications. With this change we are making it so the assistant aggregator is only useful for audio cases.

Feb 21 '25 17:02 aconchillo

Yes, but another problem to consider is that currently we can support pipelines without audio, for example, text-based applications. With this change we are making it so the assistant aggregator is only useful for audio cases.

True, I had not considered that. Agered, we need to give this more thought so we don’t miss any use cases. 👍

Feb 21 '25 19:02 filipi87

One option is to pass a text_only_pipeline: bool to PipelineParams. This woul be used when the aggregator is initialized and would result in using the current LLMFullResponseStartFrame/LLMFullResponseEndFrame behavior.

Feb 23 '25 21:02 aconchillo

One option is to pass a text_only_pipeline: bool to PipelineParams. This woul be used when the aggregator is initialized and would result in using the current LLMFullResponseStartFrame/LLMFullResponseEndFrame behavior.

But that would mean we can't have a pipeline that does both at the same time. That said, in open-sesame we have two different pipelines.

Feb 23 '25 21:02 aconchillo

What issue are we trying to solve? Is it that the TTSSpeakFrames aren't added to the context?

Feb 24 '25 15:02 markbackman

What issue are we trying to solve? Is it that the TTSSpeakFrames aren't added to the context?

Yes. We currently, after a recent change, only add them if they are pushed inside a function call. But ideally we should always add it.

Feb 24 '25 15:02 aconchillo

This issue mentions the problems with text-only pipelines and TTSTextFrame, so perhaps this could be limited to TTSSpeakFrames only for now?

Alternately, could this be a parameter that is set somewhere in the pipeline or on a processor?

Such as:

TTSSpeakFrame --add_to_assistant_contexts: bool # Add to any downstream assistant contexts
TTSSpeakFrame -- add_to_context: LLMAssistantContextAggregator # Add to a specific assistant context
LLMAssistantContextAggregator -- add_tts_speak_frames: bool # Add all downstream TTSSpeakFrames
etc.

Another option would be a convenient method to manually append text to an assistant context.

Apr 30 '25 19:04 larry-cmz

@aconchillo should we close this one?

Oct 10 '25 14:10 markbackman

Closing this as enough things in the aggregators have changed.

Nov 06 '25 15:11 markbackman