pipecat icon indicating copy to clipboard operation
pipecat copied to clipboard

User's audio is not received in Pipecat [finalize failed]

Open Atharv24 opened this issue 11 months ago • 8 comments

Sometimes, the user keeps speaking, and the bot behaves as if user is on mute. Did some digging around with trace logs, and can see the finalize call fails just after the UserStoppedSpeakingFrame. Therefore, LLM does not receive the Final transcription frame, and no response is generated from the LLM.

Image

Atharv24 avatar Jan 21 '25 18:01 Atharv24

I believe the error arises from here, since the deepgram docs specify that the return value is not guaranteed Deepgram Finalize Docs

async_client finalize method from deepgram python SDK Image

Atharv24 avatar Jan 21 '25 18:01 Atharv24

Was subsequent audio transcribed by Deepgram after the finalize failed message? That is, was only that one audio sample not transcribed and subsequent chunks were transcribed?

markbackman avatar Jan 22 '25 14:01 markbackman

@markbackman The issue once arises, remains for a few seconds (sometimes minutes too). Although it fixes itself after that.

Atharv24 avatar Jan 22 '25 16:01 Atharv24

From @Atharv24 in the Pipecat Discord:

However, there were InterimTranscriptionFrames being received before the final flush failed. Maybe we can add some sort of error boundary and stitch the interim frames together even if the final flush fails?

This is a good idea, but it would be difficult to test, since we can't reliably reproduce a 'finalize failed' from Deepgram. (Maybe we can hack the client to test it?)

chadbailey59 avatar Jan 27 '25 17:01 chadbailey59

We also see InterimTranscriptFrames that aren't followed by a TranscriptFrame.

Is this unexpected behavior from Deepgram, or is the issure more related to how Pipecat is integrating with Deepgram?

pesterhazy avatar May 14 '25 08:05 pesterhazy

We also see InterimTranscriptFrames that aren't followed by a TranscriptFrame.

This is a possible scenario with Deepgram. If a final transcript wasn't provided, my understanding is that the confidence was low enough that a final couldn't be generated.

markbackman avatar May 14 '25 12:05 markbackman

I also experienced a similar error, and it remained persistent until we restarted the pipecat process. @markbackman , can we create a new deepgram connection when we encounter such an error?

abhijitpal1247 avatar May 17 '25 14:05 abhijitpal1247

I have the same question as @pesterhazy: Does pipecat have an integration issue with Deepgram? If yes: Can the pipecat framework 'repair' this issue on-the-fly?

I found it consistently shows this error, transcribing the HumanMessage only after the second half of the sentence: (the content is shortened as the content is just a random test conversation):

INFO: main: AIMessageChunk: Hey there! I remember you were feeling concerned about gaining weight and struggling with chocolate cravings after dinner. How have things been going with that?
INFO: main: HumanMessage: And beers.
INFO: main: AIMessageChunk: Oh, I see! ... How's that been affecting you?
INFO: main: HumanMessage: Especially I work out after now.
INFO: main: AIMessageChunk: Got it! .. Does it feel frustrating to exercise and not see the results you hope for?
INFO: main: HumanMessage: find an error in the transcription.
INFO: main: AIMessageChunk: It seems ... Did you mean to say you're experiencing cravings for chocolate and beer even after working out?
INFO: main: HumanMessage: Or your code.

agilebean avatar May 19 '25 07:05 agilebean

This issue is super annoying. I already tried different Deepgram SDK Versions but it keeps happening regularly for a few minutes until it vanishes again. Restarting the process does not always help.

haayhappen avatar Jul 02 '25 19:07 haayhappen

Hello,

Has anyone made progress on this issue or found a workaround? I'd be very interested in any feedback or potential solutions.

Thanks in advance for your help!

obigroup avatar Jul 08 '25 20:07 obigroup

I've just run some additional tests. For me

pipecat-ai[tracing,silero,azure,google]==0.0.73 deepgram-sdk==3.8.0

is working, wheras

pipecat-ai[tracing,silero,azure,google]==0.0.75 deepgram-sdk==3.8.0

is breaking with the error above...

@markbackman maybe you have an idea what could be the culprit?

haayhappen avatar Jul 08 '25 22:07 haayhappen

I've just run some additional tests. For me

pipecat-ai[tracing,silero,azure,google]==0.0.73 deepgram-sdk==3.8.0

is working, wheras

pipecat-ai[tracing,silero,azure,google]==0.0.75 deepgram-sdk==3.8.0

is breaking with the error above...

@markbackman maybe you have an idea what could be the culprit?

I also tried this solution pipecat 0.0.73 with deepgram 3.8.0 but it still doesn't work.

obigroup avatar Jul 09 '25 10:07 obigroup

Here's the additional information you can add to your issue:

Additional Context - Platform-Specific Behavior

The "finalize failed" error only occurs when using Twilio as the audio source. The same STT WebSocket client works perfectly with Daily without any issues. This platform-specific behavior suggests the problem might be related to:

Audio Format Differences: Twilio and Daily may send audio in different formats (sample rate, encoding, bit depth, etc.) Stream Termination Handling: Different signaling mechanisms for end-of-stream between the two platforms Packet Timing: Twilio might have different buffering or packet delivery patterns that affect the finalization process WebSocket Headers/Metadata: Platform-specific headers or metadata that could impact the STT processing

Debugging Steps Taken:

✅ Works with Daily audio streams ❌ Fails with Twilio audio streams Same STT client code used for both platforms

Next Steps: Need to investigate audio format specifications and stream handling differences between Twilio and Daily to identify the root cause of the finalization failure. This information should help narrow down the investigation focus to platform-specific audio handling differences.

obigroup avatar Jul 09 '25 18:07 obigroup

There was a regression in 0.0.74 when we updated the resampler. This is causing resampling to not work correctly for Telephony providers (and possibly other cases). We're going to look into this right away.

As a workaround, you can avoid resampling for Twilio by setting the audio_in_sample_rate in the PipelineParams. For example:

    task = PipelineTask(
        pipeline,
        params=PipelineParams(
            enable_metrics=True,
            enable_usage_metrics=True,
            audio_in_sample_rate=8000,
        ),
    )

Apologies for the issue!

markbackman avatar Jul 10 '25 19:07 markbackman

Thank you for the quick response and the suggested workaround!

I've tested the workaround by setting audio_in_sample_rate=8000 in multiple places:

  • In PipelineParams
  • In DeepgramSTTService
  • In FastAPIWebsocketParams

Unfortunately, this hasn't resolved the issue for me. I'm still experiencing the same problem with resampling not working correctly.

Could you provide any additional guidance or alternative workarounds while you investigate the regression? I'm happy to provide more details about my setup or help test potential fixes if needed.

obigroup avatar Jul 11 '25 09:07 obigroup

@obigroup the best practice is to set the audio_in_sample_rate and audio_out_sample_rate in the PipelineParams. Why does this work? These sample rates are passed to the StartFrame, then upon initialization, every processor reads the StartFrame and updates its sample rate accordingly. This ensures that all processors are set with a matching sample rate and avoids issues.

Can you provide a simple example that doesn't work for you? You can run all of the foundational examples with either a WebRTC transport for FastAPI transport. I've run a number of them and it solves the issue for me.

Also, to clarify, are you seeing finalize failed on every request or just intermittently? With 0.0.74 and 0.0.75, I see it for every request when using the FastAPIWebsocketTransport (e.g. Twilio). This is resolved by setting matching sample rates. The issue has to do with silence audio that's somehow getting passed to Deepgram when resampling. We're investigating this right now.

markbackman avatar Jul 11 '25 13:07 markbackman

Unlike create_file_resampler, which always returns audio data, create_stream_resampler may sometimes return empty data.

This happens because create_stream_resampler, which uses SOXRStreamAudioResampler, maintains internal state between calls to preserve audio quality at chunk boundaries. It buffers audio internally to handle resampling more accurately, which can result in either empty output or a larger than expected output size due to this buffering behavior.

To address this, we implemented three fixes in the PR that I mentioned above:

  • Avoid creating an InputAudioRawFrame when there are no audio bytes.
  • Skip serializing a JSON message when no audio is present.
  • Fixed the VAD analyzer to process the full audio buffer as long as it contains more than the minimum required bytes per iteration, instead of only analyzing the first chunk.

This will be available on Pipecat 0.0.76.

filipi87 avatar Jul 11 '25 18:07 filipi87

Thank you @filipi87 for the detailed explanation about the create_stream_resampler behavior and the fixes implemented in the PR. However, I've tested with Pipecat 0.0.76 and the issue persists.

Here are the logs I'm seeing:

2025-07-14 10:10:58.889 | DEBUG    | pipecat.transports.base_output:_bot_started_speaking:564 - Bot started speaking
2025-07-14 10:11:01.307 | DEBUG    | pipecat.transports.base_input:_handle_user_interruption:348 - User started speaking
2025-07-14 10:11:02.763 | DEBUG    | pipecat.transports.base_output:_bot_stopped_speaking:580 - Bot stopped speaking
2025-07-14 10:11:03.849 | DEBUG    | pipecat.transports.base_input:_handle_user_interruption:372 - User stopped speaking
2025-07-14 10:11:04.350 | WARNING  | pipecat.processors.aggregators.llm_response:push_aggregation:531 - User stopped speaking but no new aggregation received.
2025-07-14 10:11:09.347 | DEBUG    | pipecat.transports.base_input:_handle_user_interruption:348 - User started speaking
2025-07-14 10:11:11.847 | DEBUG    | pipecat.transports.base_input:_handle_user_interruption:372 - User stopped speaking

The warning message "User stopped speaking but no new aggregation received." suggests that the fixes for the stream resampler may not be fully addressing the issue, or there might be additional edge cases that need to be handled.

Could you please investigate further or provide additional guidance on how to resolve this persistent issue?

obigroup avatar Jul 14 '25 08:07 obigroup

The warning message "User stopped speaking but no new aggregation received." suggests that the fixes for the stream resampler may not be fully addressing the issue, or there might be additional edge cases that need to be handled.

Hi @obigroup, this message isn’t related to the current issue.

It’s part of an edge case where:

  • The user's speech is detected by the VAD.
  • This interrupts the bot, which then stops speaking.
  • STT doesn’t return any transcription.

We added this message a couple of releases ago to help us understand how frequently this happens and whether we should try to handle it somehow.

If the user speaks again and that generates a transcription, you’ll see that everything is still working as expected.

We are still deciding what to do in this edge case that I have explained above.

filipi87 avatar Jul 14 '25 12:07 filipi87

Still getting this as of today with latest everything (uv add pipecat-ai[...]). Twilio Websocket setup.

I was using an unsupported language for nova-3.

Consider improving the integration error reporting.

elnygren avatar Sep 03 '25 12:09 elnygren

I was using an unsupported language for nova-3.

Consider improving the integration error reporting.

Deepgram's connection.start() returns False on failure, but doesn't provide error details, so there's nothing Pipecat can raise in this case. This would be a good issue to file with Deepgram.

markbackman avatar Sep 03 '25 14:09 markbackman