pipecat icon indicating copy to clipboard operation
pipecat copied to clipboard

AudioBufferProcessor miscomputing silence?

Open glennpow opened this issue 2 months ago • 25 comments

pipecat version

0.0.90

Python version

3.11.11

Operating System

macOS 15.6.1

Issue description

I was noticing what seemed like misalignment of the user and bot audio clips in the merged output from AudioBufferProcessor, and after looking at the code I'm wondering if it's because the computed silence might be incorrect. It is currently using properties to track the last frame times and then subtracting from current time to determine how much silence to add to the output streams. However, it's also adding the actual audio clip to the stream, so wouldn't that mean that the silence should also take into account the duration of the audio clip? I.e. for the user audio stream:

            frame_time = frame.num_frames / frame.sample_rate  # This should be added to the last timestamp?
            self._last_user_frame_at = time.time() + frame_time

Reproduction steps

Save merged audio output.

Expected behavior

User and bot audio should be properly aligned.

Actual behavior

Audio isn't aligned.

Logs


glennpow avatar Oct 14 '25 15:10 glennpow

Can you share a simple repro case demonstrating the issue?

We recently fixed an alignment issue and our testing confirms all is well. It's possible that you're using the AudioBufferProcessor in a way that we haven't tested, so sharing a repro case will help.

markbackman avatar Oct 14 '25 15:10 markbackman

@markbackman Perhaps I wasn't using the latest code, but doesn't my calculation still stand? I made a diagram to show what I believe to be the currently calculated silence time versus what I'd expect. Currently, it doesn't take into account the duration of the audio clip, so this would be accurate, right? (The image may be wrong to use the audio start times as the event triggers, rather than the audio stop which I believe is actually correct, but it should still work for explanatory purposes)

Image

glennpow avatar Oct 15 '25 23:10 glennpow

@glennpow i am also having this kind of behavior from 4 to 5 days, I am having user's audio delayed in the recording

In actual scenerio I have answered all the bot question immediately once bot is stopped speaking

But In Recordings i am getting user's audio way to delayed

I am attaching my audio recording below please hear this for once

merged_20251016_101014.wav

ParthShindenovus avatar Oct 17 '25 04:10 ParthShindenovus

@ParthShindenovus Yes, and actually unfortunately it seems like the solution I propose above doesn't fully fix the issue. @markbackman Can you confirm that when you make a recording now the merged audio is always properly aligned?

glennpow avatar Oct 17 '25 14:10 glennpow

@glennpow yes, my audio is fully aligned in all of the scenarios that I've tested. If you have a single file repro that shows misalignment, that would be very helpful. @ParthShindenovus shared one in Discord, but I have no misalignment in running it.

markbackman avatar Oct 18 '25 13:10 markbackman

@markbackman I just posted an audio clip to Discord.

glennpow avatar Oct 18 '25 15:10 glennpow

@markbackman I just posted an audio clip to Discord.

Thanks for the clip, but what I really need is a simple, single file repro of the issue. Ideally, something that takes the 34-audio-recording.py example and modifies it in a way that makes this issue reproducible.

markbackman avatar Oct 19 '25 16:10 markbackman

I'm also facing the same issue when I integrate MCPClient with the agent. and I tried removing MCPClient, It works fine when removed.

I know It's super weird but I took the 34-audio-recording.py which is working fine and I kept adding code from my agent and after each functionality I tested it, it only happens when I add MCPClient tools not sure why

rohitkhatri avatar Oct 28 '25 13:10 rohitkhatri

This issue is happening in latest version also

ayubSubhaniya avatar Nov 01 '25 10:11 ayubSubhaniya

@rohitkhatri thanks for sharing a repro for this! We'll take a look this week.

@ayubSubhaniya are you also using the MCPClient?

markbackman avatar Nov 01 '25 11:11 markbackman

We’re currently live in production and rely on the audio-merging functionality of the library for our sentiment module. Because of this silence-computation bug, we’re unable to put our audio sentiment module on top.

Could you let us know if there is a temporary workaround we could apply (for example, manual silence insertion, adjusting timestamps, or using an alternate processing path) until the fix is rolled out? Additionally, do you have an estimated timeframe for when this bug might be resolved in a stable release (or a nightly build)?

Thanks again for your support!

piyushjain0106 avatar Nov 01 '25 13:11 piyushjain0106

@piyushjain0106 are you using the MCPClient?

markbackman avatar Nov 01 '25 17:11 markbackman

@markbackman OP here. I've never used the MCPClient, so I'm fairly certain this has nothing to do with it.

glennpow avatar Nov 01 '25 17:11 glennpow

@glennpow can you please share a minimal repro (e.g. code that I can run that hits the issue) so we can investigate and fix? I can confirm that the 34-audio-recording example works as expected. @rohitkhatri confirmed that as well.

Without a repro, it will be difficult to isolate the issue as this is working correctly in example 34. It's possible that other frame processors are interfering. Do you have any custom frame processors in your Pipeline?

markbackman avatar Nov 01 '25 17:11 markbackman

@markbackman just a thought, shouldn't recording be raw? After understanding current code, it is post audio filter application, so if any noise or other filter it will be cancelled in recording.

ayubSubhaniya avatar Nov 02 '25 15:11 ayubSubhaniya

Also level of recording overlap gets worst on extreme load like 20-30 concurrent calls

ayubSubhaniya avatar Nov 02 '25 15:11 ayubSubhaniya

just a thought, shouldn't recording be raw? After understanding current code, it is post audio filter application, so if any noise or other filter it will be cancelled in recording.

It's a processor in the Pipeline, so it receives the InputAudioRawFrame based on the positioning in the Pipeline. To pick up user and bot audio, aligned with the timing of audio transmitted to the user, you place it after the transport output processor. This means that the user audio is processed, augmented by filters if present and bot audio is raw.

You can get raw audio, but you'd have to get it directly from the transport provider.

Also level of recording overlap gets worst on extreme load like 20-30 concurrent calls

This sounds like a resourcing issue on your infrastructure. We recommend that voice bots run in their own process with 0.5 vCPU for each instance. You may require more depending on what your application does (video, video avatar, etc.). Essentially, you need to ensure that each bot has an equal and sufficient amount of resources allocated.

markbackman avatar Nov 02 '25 16:11 markbackman

Ohk thanks a lot @markbackman for reply. On resource part do you recommend 1 process per call? Like something of multipool executor?

Or just to detach voice call with the main IO loop by multiprocessing?

At the moment I ran around 10-20 concurrent calls from same pod using fastapi

ayubSubhaniya avatar Nov 02 '25 19:11 ayubSubhaniya

I am pretty sure I am also seeing this. I can't get a code sample quite just because of the way I have my stuff set up (I'll work on getting one). I checked twilio's recording and the silence and turns were all correct but the audio buffer processor spit out a bunch of incorrect audio segment timing.

kobicovaldev avatar Nov 03 '25 04:11 kobicovaldev

can everyone who has experienced this issue, list the following in a post:

  1. your pipecat version
  2. python version
  3. the transport type you are using
  4. where the app is hosted / where you are seeing this behavior (ie pipecat cloud, self hosted, local development)
  5. whether or not you use MCP client

for example:

  1. 0.0.92
  2. 3.12.9
  3. FastAPIWebsocketTransport
  4. pipecat cloud
  5. False

vipyne avatar Nov 03 '25 19:11 vipyne

@markbackman You can get raw audio, but you'd have to get it directly from the transport provider. I even tried this https://docs.pipecat.ai/server/utilities/audio/audio-buffer-processor#event-handlers but the recording it is generating is coming jumbled to me, order of sentences is coming wrong

piyushjain0106 avatar Nov 03 '25 19:11 piyushjain0106

@vipyne your pipecat version -> 0.0.87 python version -> 3.11 the transport type you are using -> FastAPIWebsocketTransport where the app is hosted / where you are seeing this behavior (ie pipecat cloud, self hosted, local development) -> self hosted whether or not you use MCP client -> no

piyushjain0106 avatar Nov 03 '25 19:11 piyushjain0106

For me: 0.0.91 3.12.9 FastAPIWebsocketTransport self hosted False

I actually found the issue in my case. I had a couple sneaky time.sleep(X) in my pipeline start up that was causing issues with the AuidoBufferProcessor.

kobicovaldev avatar Nov 04 '25 00:11 kobicovaldev

can everyone who has experienced this issue, list the following in a post:

  1. your pipecat version
  2. python version
  3. the transport type you are using
  4. where the app is hosted / where you are seeing this behavior (ie pipecat cloud, self hosted, local development)
  5. whether or not you use MCP client

for example:

  1. 0.0.92
  2. 3.12.9
  3. FastAPIWebsocketTransport
  4. pipecat cloud
  5. False

0.0.92 3.13.2 FastAPIWebsocketTransport Self hosted True

rohitkhatri avatar Nov 04 '25 08:11 rohitkhatri

  1. 0.0.92
  2. 3.11.11
  3. FastAPIWebsocketTransport
  4. Self hosted
  5. False

glennpow avatar Nov 04 '25 21:11 glennpow

0.0.96 3.13. FastAPIWebsocketTransport Self hosted False

  • Damn, so no one figured it out after 2 months? I even used the 34 code example, and it still had same issues.

tuduun avatar Dec 05 '25 00:12 tuduun

  1. 0.0.93
  2. 3.12
  3. FastAPIWebsocketTransport
  4. Self hosted
  5. False

poseneror avatar Dec 08 '25 21:12 poseneror