pipecat
pipecat copied to clipboard
Improve Reliability - Fallback Processors for LLMs
Problem Statement
LLMs can fail or timeout or take too much time (> 5 secs)
Here are some metrics showcasing this behaviour - (These graphs were plotted using ~30,000 metrics data poins from Google AI Studio Endpoint(gemini-2.0-flash-lite), Open AI Endpoint (gpt-4o & gpt-4.1-mini) and Azure OpenAI endpoints.)
In below image you can see p95 latency is somewhere around (700 ms) however at certain times the latency itself can go up to 25 seconds or even more, the LLM endpoints themselves don't have defined SLAs/SLOs
Also, one of the screenshot contains a reported error rate of 0.39% which is not too much but enough to create bad experience for the users
As for unavailability, Google can respond with Response Code 503 as documented in https://ai.google.dev/gemini-api/docs/troubleshooting#error-codes
Whenever such issues happen, the duration of outage/error state is indeterminate.
Proposed Solution
A FallbackLLMService processor, with an ability to -
- Take multiple diverse set of LLM Service as inputs
- Switch LLMs if the primary fails
- Switch LLMs if the primary doesn't respond in user defined timeout (5 second is good for a default)
- (Optional) Switch back when the primary is working again (this can be checked everytime we make a LLM call with the secondary, with max_retries in a non-blocking fashion)
Alternative Solutions
No response
Additional Context
No response
Would you be willing to help implement this feature?
- [ ] Yes, I'd like to contribute
- [ ] No, I'm just suggesting
Hi there, this is actually already achievable with Parallel Pipeline.
Hi there, this is actually already achievable with Parallel Pipeline.
It is definitely achievable using Parallel Pipeline but it would be nice to have special fallback processors for services like LLM.
Parallel Pipeline has a computatioin overhead in my opinion and it also makes the code messy as you'd need to implement all sorts of gates in order to fallback
I understand your concern, in fact, I just came across this recently. But after digging deeper in the Pipecat source code and architecture, I realize whether you implement a "fallback processor" yourself, or use something like Function Filter as a logic gate paired with Parallel Pipeline, it's the same, just FrameProcessor underneath, and it will be triggered for all the frame just pass through it.
So I don't think it's necessary to create a dedicated processor just for fallback the LLM, cause it's all just FrameProcessor, and the result wouldn't be that different from the current approach with Parallel Pipeline. However, it's up to the Pipecat team to make the final decision, so if you're having an example of the "fallback processor" you want to have upstream, it would be great if you can put it here.
I do think it would be useful on LLMs (and other AI services) to have a timeout param and have some on timeout and on error handling. AI services inconsistently send ErrorFrame and raise exceptions so it's hard to consistently use either to detect issues without patching the source code.
Proper fallbacks require both determining when to fallback and the logic to hot-swap services. There isn't really a "recommended" or built-in way to do either of these as far as I know
We've made progress here:
- Most LLMs now have a
retry_on_timeoutfeature that includes a configurable timeout value calledretry_timeout_secs. This forces a retry when the LLM is too slow on the first completion. - We've added a ServiceSwitcher class and an LLMSwitcher class, which allow you to run multiple version of the same type of service and switch between services. You can write your own strategy. We'll be adding more built-in failover options in the future.
I'm closing out this issue based on the progress we've made.