pipecat icon indicating copy to clipboard operation
pipecat copied to clipboard

LLMAssistantContextAggregator: always aggregate TTSTextFrame

Open aconchillo opened this issue 10 months ago • 9 comments

Please describe the changes in your PR. If it is addressing an issue, please reference that as well.

This is the assistant aggregator and we should add everything that is being spoekn. Because of that we should use TTSTextFrame because those are the frames that are actually spoken. We should send the aggregation as soon as the bot stops speaking.

So, we don't need to handle LLMFullResponseStartFrame and LLMFullResponseEndFrame anymore.

aconchillo avatar Feb 20 '25 23:02 aconchillo

Converting this to draft for now. I'm not 100% sure this is how it should be implemented. Even though it seems to make sense (because we only add TTSTextFrame to the context) we need to give it more thought.

aconchillo avatar Feb 21 '25 01:02 aconchillo

One good thing about this approach is that whatever we push through TTSSpeakFrame will always be part of the context, creating a unified behavior.

Because with the current approach, it is included only if sent during a function call (between LLMFullResponseStartFrame and LLMFullResponseEndFrame). Which feels like a bug, where sometimes is included and sometimes not.

filipi87 avatar Feb 21 '25 11:02 filipi87

One good thing about this approach is that whatever we push through TTSSpeakFrame will always be part of the context, creating a unified behavior.

Because with the current approach, it is included only if sent during a function call (between LLMFullResponseStartFrame and LLMFullResponseEndFrame). Which feels like a bug, where sometimes is included and sometimes not.

Yes, but another problem to consider is that currently we can support pipelines without audio, for example, text-based applications. With this change we are making it so the assistant aggregator is only useful for audio cases.

aconchillo avatar Feb 21 '25 17:02 aconchillo

Yes, but another problem to consider is that currently we can support pipelines without audio, for example, text-based applications. With this change we are making it so the assistant aggregator is only useful for audio cases.

True, I had not considered that. Agered, we need to give this more thought so we don’t miss any use cases. 👍

filipi87 avatar Feb 21 '25 19:02 filipi87

One option is to pass a text_only_pipeline: bool to PipelineParams. This woul be used when the aggregator is initialized and would result in using the current LLMFullResponseStartFrame/LLMFullResponseEndFrame behavior.

aconchillo avatar Feb 23 '25 21:02 aconchillo

One option is to pass a text_only_pipeline: bool to PipelineParams. This woul be used when the aggregator is initialized and would result in using the current LLMFullResponseStartFrame/LLMFullResponseEndFrame behavior.

But that would mean we can't have a pipeline that does both at the same time. That said, in open-sesame we have two different pipelines.

aconchillo avatar Feb 23 '25 21:02 aconchillo

What issue are we trying to solve? Is it that the TTSSpeakFrames aren't added to the context?

markbackman avatar Feb 24 '25 15:02 markbackman

What issue are we trying to solve? Is it that the TTSSpeakFrames aren't added to the context?

Yes. We currently, after a recent change, only add them if they are pushed inside a function call. But ideally we should always add it.

aconchillo avatar Feb 24 '25 15:02 aconchillo

This issue mentions the problems with text-only pipelines and TTSTextFrame, so perhaps this could be limited to TTSSpeakFrames only for now?

Alternately, could this be a parameter that is set somewhere in the pipeline or on a processor?

Such as:

  • TTSSpeakFrame --add_to_assistant_contexts: bool # Add to any downstream assistant contexts
  • TTSSpeakFrame -- add_to_context: LLMAssistantContextAggregator # Add to a specific assistant context
  • LLMAssistantContextAggregator -- add_tts_speak_frames: bool # Add all downstream TTSSpeakFrames
  • etc.

Another option would be a convenient method to manually append text to an assistant context.

larry-cmz avatar Apr 30 '25 19:04 larry-cmz

@aconchillo should we close this one?

markbackman avatar Oct 10 '25 14:10 markbackman

Closing this as enough things in the aggregators have changed.

markbackman avatar Nov 06 '25 15:11 markbackman