pipecat icon indicating copy to clipboard operation
pipecat copied to clipboard

Feature Request: Expose More Native Gemini Live API Parameters in `GeminiMultimodalLiveLLMService`

Open getchannel opened this issue 8 months ago • 3 comments

Requesting more access to the GeminiMultimodalLive api model for specific use cases. For example:

  1. Duplicate Responses with Tools: Without access to tool_config.function_calling_config.mode, it's difficult to prevent the model from sometimes generating two responses when a tool (like Google Search) is used – one initial response from training data, and a second grounded response after the tool runs. Setting mode: ANY in the native API can address this.
  2. Limited Control Over VAD: The native API now offers detailed VAD configuration via realtimeInputConfig, which isn't fully exposed.
  3. Session Management: Fine-grained control over sessionResumption and handling GoAway messages directly available.
  4. Context Management: Explicit configuration of contextWindowCompression for longer sessions isn't exposed.
  5. Media Configuration: Direct control over mediaResolution .

We request that the GeminiMultimodalLiveLLMService be updated to expose more of the configuration parameters available in the native Google Gemini Live API. Key parameters that would be valuable include:

  • tool_config: Specifically function_calling_config with its mode parameter (AUTO, ANY, NONE).
  • realtimeInputConfig: To allow finer control over Voice Activity Detection settings (e.g., disabling automatic VAD).
  • sessionResumption: Exposing the SessionResumptionConfig options.
  • contextWindowCompression: Exposing ContextWindowCompressionConfig options (like sliding_window).
  • mediaResolution: Allowing explicit setting of media resolution (e.g., MEDIA_RESOLUTION_LOW, MEDIUM, HIGH).
  • Access to Server Events: Potentially providing ways to hook into or be notified of server events like GoAway or GenerationComplete.

These could potentially be added to the existing InputParams or a new dedicated configuration object passed to the GeminiMultimodalLiveLLMService constructor.

Currently, the primary alternative is to modify the system_instruction prompt to guide the model's behavior (e.g., asking it to wait for tool results). However, this is less direct and potentially less reliable than controlling the underlying API parameters.

Exposing these native parameters would significantly enhance the flexibility and power of the GeminiMultimodalLiveLLMService for Pipecat users. It would allow for more robust handling of tool calls, better management of long-running sessions, optimized media handling, and closer alignment with the full capabilities of the Gemini Live API. The relevant service implementation appears to be in src/pipecat/services/gemini_multimodal_live/gemini.py.

Thank you for considering this request to enhance the integration with the Gemini Live API.

getchannel avatar Apr 17 '25 14:04 getchannel

We're working on this. We've added a few items already:

  • RealtimeInputConfig
  • ContextWindowCompression
  • MediaResolution

We'll add more over time.

markbackman avatar May 06 '25 16:05 markbackman

Can you also add Language params ? Flash 2.0 001 supports various language codes that introduce subtle accents which are great for deploying regional voice agents

AmolDerickSoans avatar May 15 '25 16:05 AmolDerickSoans

Can you also add Language params ? Flash 2.0 001 supports various language codes that introduce subtle accents which are great for deploying regional voice agents

language is already a supported InputParam: https://docs.pipecat.ai/server/services/s2s/gemini#param-language

Supported codes: https://docs.pipecat.ai/server/services/s2s/gemini#language-support

markbackman avatar May 16 '25 15:05 markbackman

Hi +1 to requiring support for session resumption

bo-socayo avatar Jun 07 '25 02:06 bo-socayo