pipecat icon indicating copy to clipboard operation
pipecat copied to clipboard

User audio contains ticking noise when recording with AudioBufferProcessor

Open golbin opened this issue 8 months ago • 5 comments

pipecat version

0.0.58~0.0.65

Python version

3.12

Operating System

Linux, MacOS

Issue description

I'm using the AudioBufferProcessor to handle audio recording.

While the bot voice and background audio sound fine, there's a consistent ticking noise whenever the user's voice is captured.

I've attached a sample recording that highlights this issue.

Is this expected behavior? If not, could this be a configuration issue? I'd appreciate any guidance on how to resolve it.

recording-noise.m4a.zip

Thanks!

Reproduction steps

I'm using DailyTransport and AudioBufferProcessor pretty much the same way as in the samples. For the client, I'm using Daily React Native.

        DailyParams(
            audio_in_enabled=True,
            audio_out_enabled=True,
            camera_out_enabled=False,
            vad_enabled=True,
            vad_analyzer=SileroVADAnalyzer(),
            vad_audio_passthrough=True,
        ),

Expected behavior

No ticking noise occurs.

Actual behavior

While the bot voice and background audio sound fine, there's a consistent ticking noise whenever the user's voice is captured.

Logs


golbin avatar Apr 24 '25 09:04 golbin

Yes, I have noticed this too. We will look into it.

markbackman avatar Apr 24 '25 14:04 markbackman

The clicking issue is because of upsampling. The AudioBufferProcessor is probably set to a higher sample rate than the user's input sample rate. Getting to a higher sample rate from a lower sample rate is tricky and not all libraries do it well. It might be we are using the libraries in a wrong way. I'll take a look.

aconchillo avatar Apr 24 '25 16:04 aconchillo

The clicking issue is because of upsampling. The AudioBufferProcessor is probably set to a higher sample rate than the user's input sample rate. Getting to a higher sample rate from a lower sample rate is tricky and not all libraries do it well. It might be we are using the libraries in a wrong way. I'll take a look.

To temporarily fix this issue try to match the sample rate so there's now upscaling needed.

aconchillo avatar Apr 24 '25 16:04 aconchillo

This is causing the problem, right?

# StartFrame option
audio_in_sample_rate: 16000
audio_out_sample_rate: 24000

It's tricky because we don't need a higher audio_in_sample_rate, but having a lower audio_out_sample_rate leads to a poor user experience...

golbin avatar Apr 28 '25 06:04 golbin

I have the same problem with Gemini (text mode). I'd like to record TTS with a better sampling rate (the difference is noticeable). User audio is fixed at 16000 for Gemini I think. Is there a reason to set a single sampling rate for the AudioBuffer? When using on_track_audio_data it could be possible to just keep both sample rates different and skip resampling alltogether?

aristid avatar May 09 '25 21:05 aristid

Bump! I think this issue is quite important..

golbin avatar Jun 04 '25 15:06 golbin

Just for reference — I haven’t tested the content below yet.

Potential Solution Summary

  • Use resample_chunk of sorx for each frame, and flush on EndFrame or CancelFrame
  • Convert audio format from int16 to float32

ChatGPT

Below is an English rewrite of the guidance I gave earlier, keeping the same technical depth.


Why those crackles/ticks happen

  1. Filter state resets every chunk Calling soxr.resample() on each small buffer recreates the FIR filter, so discontinuities appear at chunk boundaries.

  2. Integer-in / integer-out processing Feeding int16 directly forces SoXR to run its integer path and round every sample, which raises quantisation noise and can clip on loud material.


Fix 1 — Use the streaming API (ResampleStream)

python-soxr ships a dedicated streaming resampler that keeps the filter’s internal state between calls. That alone removes most boundary clicks:

import soxr
import numpy as np

class SoxrResampler:
    def __init__(self, in_sr: int, out_sr: int,
                 channels: int = 1,
                 dtype: str = "float32",
                 quality: str = "VHQ"):
        self.stream = soxr.ResampleStream(
            in_sr, out_sr, channels, dtype=dtype, quality=quality
        )

    def process_chunk(self, pcm: bytes, last: bool = False) -> bytes:
        # 16-bit PCM → float32 in −1…1 range
        x = np.frombuffer(pcm, np.int16).astype(np.float32) / 32768.0
        # mono ↔ stereo reshape if needed:
        # x = x.reshape(-1, channels)
        y = self.stream.resample_chunk(x, last=last)
        # float32 → int16 with safe clipping
        y_int16 = np.clip((y * 32768.0).round(), -32768, 32767).astype(np.int16)
        return y_int16.tobytes()

Call process_chunk(buf, last=False) for every incoming buffer, then once with last=True when the stream ends.


Fix 2 — Run the filter in float32

The docs note that if the input is not an ndarray, SoXR converts it to float32; but if you pass an int16 array it keeps that type. Converting explicitly prevents repeated int-int rounding:

def resample_block(audio: bytes, in_rate: int, out_rate: int) -> bytes:
    if in_rate == out_rate:
        return audio

    x = np.frombuffer(audio, np.int16).astype(np.float32) / 32768.0
    y = soxr.resample(x, in_rate, out_rate, quality="HQ")
    y_int16 = np.clip((y * 32768.0).round(), -32768, 32767).astype(np.int16)
    return y_int16.tobytes()

Extra checklist

Item Recommendation
Channel layout For stereo, shape the array as (frames, 2) before resampling.
Quality preset HQ is usually enough for speech/TTS; VHQ costs more CPU.
Dithering When you must go back to 16-bit, optional soxr.DitherSpec('shibata') can make low-level noise less audible.
Clipping guard Always np.clip after scaling back to integers.

Quick take-away

Use ResampleStream so the filter remains coherent across buffers, and perform internal math in float32 before converting back to int16. With those two changes, the 16 kHz ↔ 24 kHz conversions should be free of crackles in normal use.

Gemini

Of course. Here is the explanation in English.

The crackling/ticking noise you're experiencing when resampling audio with the soxr library is typically caused by waveform discontinuity at the boundaries of your audio chunks. Your current code processes each audio chunk independently, which can lead to audible artifacts where the chunks meet.

The Cause

Audio is a continuous wave. When you process a long audio stream in smaller, independent chunks, the end of one resampled chunk doesn't smoothly connect to the beginning of the next. This abrupt change in the waveform is perceived as a "tick" or "crackle."

The Solution

The most effective way to solve this is to use soxr's streaming API, which is designed to handle audio chunk by chunk while maintaining state. The soxr.Resampler class remembers the end of the previous chunk to ensure the next one starts smoothly, preserving waveform continuity.

Here is a revised code example using the soxr.Resampler class for stateful, streaming resampling.

import numpy as np
import soxr

class AudioResampler:
    def __init__(self, in_rate: int, out_rate: int, dtype=np.int16):
        """
        Initializes the resampler.
        in_rate: Input sample rate
        out_rate: Output sample rate
        dtype: Audio data type
        """
        self.resampler = soxr.Resampler(in_rate, out_rate, dtype=dtype)
        self.in_rate = in_rate
        self.out_rate = out_rate

    def resample_chunk(self, audio_chunk: bytes) -> bytes:
        """
        Resamples an individual audio chunk.
        """
        if self.in_rate == self.out_rate:
            return audio_chunk

        audio_data = np.frombuffer(audio_chunk, dtype=self.resampler.dtype)
        
        # Use the stateful resample_chunk method
        resampled_audio = self.resampler.resample_chunk(audio_data)
        
        return resampled_audio.tobytes()

    def flush(self) -> bytes:
        """
        Processes the final audio data remaining in the resampler's buffer.
        This must be called at the very end of the audio stream.
        """
        # The flush method retrieves any remaining samples
        flushed_audio = self.resampler.flush()
        
        if flushed_audio is not None and flushed_audio.size > 0:
            return flushed_audio.tobytes()
        return b""

# --- Example Usage ---
async def process_audio_stream(audio_stream, in_rate, out_rate):
    # Initialize the resampler once outside the loop
    resampler = AudioResampler(in_rate, out_rate)
    
    # Process each chunk from the audio stream
    async for audio_chunk in audio_stream:
        resampled_chunk = resampler.resample_chunk(audio_chunk)
        # Do something with the resampled_chunk (e.g., play, save, send)
        yield resampled_chunk

    # CRITICAL: Flush the resampler after the last chunk has been processed
    last_chunk = resampler.flush()
    if last_chunk:
        yield last_chunk

Code Explanation

  1. AudioResampler Class:

    • __init__: An instance of soxr.Resampler is created and stored within the class. This object will maintain its internal state across multiple calls.
    • resample_chunk: This method takes a bytes chunk, converts it to a NumPy array, and passes it to self.resampler.resample_chunk(). This stateful method uses information from previous chunks to ensure the output waveform is continuous, eliminating clicks.
    • flush: After all audio chunks have been processed, the resampler may still have a few samples in its internal buffer. The flush() method clears this buffer and returns the final piece of audio. You must call this once at the end of the stream to avoid losing audio data.
  2. Asynchronous Usage:

    • Considering your original function was async def, the example shows how to use this class in an asynchronous context. The key is to create the AudioResampler instance before the loop starts and then call resample_chunk for each chunk within the loop.

Key Summary

  • Problem: Resampling independent audio chunks creates discontinuities at their boundaries.
  • Solution: Use the stateful soxr.Resampler class (the streaming API) instead of the stateless soxr.resample() function.
  • Crucial Step: Always call the flush() method after processing the final audio chunk to retrieve any remaining samples from the resampler's buffer.

Claude

Looking at your code, there are several potential causes for the crackling/ticking issues during resampling between 16kHz and 24kHz. Here are some solutions to try:

1. Fix Audio Data Normalization Issues

The most common cause is audio values exceeding the int16 range after resampling:

async def resample(self, audio: bytes, in_rate: int, out_rate: int) -> bytes:
    if in_rate == out_rate:
        return audio
    
    # Convert int16 to float32 for better precision
    audio_data = np.frombuffer(audio, dtype=np.int16).astype(np.float32)
    
    # Normalize to -1.0 ~ 1.0 range
    audio_data = audio_data / 32768.0
    
    # Resample
    resampled_audio = soxr.resample(audio_data, in_rate, out_rate, quality="VHQ")
    
    # Prevent clipping and convert back to int16
    resampled_audio = np.clip(resampled_audio * 32768.0, -32768, 32767)
    result = resampled_audio.astype(np.int16).tobytes()
    
    return result

2. Safer Alternative Using scipy

Using scipy's resample_poly instead of SoXR can be a good alternative:

from scipy.signal import resample_poly
import math

async def resample(self, audio: bytes, in_rate: int, out_rate: int) -> bytes:
    if in_rate == out_rate:
        return audio
    
    audio_data = np.frombuffer(audio, dtype=np.int16).astype(np.float32)
    
    # Find GCD for rational resampling
    gcd = math.gcd(in_rate, out_rate)
    up = out_rate // gcd
    down = in_rate // gcd
    
    # Resample with automatic anti-aliasing filter
    resampled_audio = resample_poly(audio_data, up, down)
    
    # Clip and convert
    resampled_audio = np.clip(resampled_audio, -32768, 32767)
    result = resampled_audio.astype(np.int16).tobytes()
    
    return result

3. Optimize SoXR Settings

If you want to continue using SoXR, try safer settings:

async def resample(self, audio: bytes, in_rate: int, out_rate: int) -> bytes:
    if in_rate == out_rate:
        return audio
    
    audio_data = np.frombuffer(audio, dtype=np.int16).astype(np.float64)
    audio_data = audio_data / 32768.0
    
    # More conservative quality settings with anti-aliasing
    resampled_audio = soxr.resample(
        audio_data, 
        in_rate, 
        out_rate, 
        quality="HQ",  # Use HQ instead of VHQ
        precision=33   # Higher precision
    )
    
    # Soft clipping
    resampled_audio = np.tanh(resampled_audio) * 0.95  # Soft clipping
    resampled_audio = resampled_audio * 32767
    result = resampled_audio.astype(np.int16).tobytes()
    
    return result

4. Additional Considerations

  • Buffer size: If resampling small audio chunks, consider processing with some padding before and after
  • DC offset removal: Removing DC components before resampling can help

Try the first solution (adding normalization) first, and if the problem persists, test the other methods sequentially.

golbin avatar Jul 01 '25 03:07 golbin

We're working on a fix here: https://github.com/pipecat-ai/pipecat/pull/2091. This will be in the upcoming Pipecat release.

markbackman avatar Jul 01 '25 16:07 markbackman

User audio is clean now for me, but assistant audio recording is clicking (was clean for me before). Not sure what has changed, the resampler shouldn't resample the assistent voice? I have audio_out_sample_rate=24000, audio_in_sample_rate=16000 and I'm recording with AudioBufferProcessor at 24000. Were there other audio changes in 0.7.4?

aristid avatar Jul 04 '25 01:07 aristid

Do you hear this crackling consistently? I've just tested using the 34 foundational example and I really can't hear any crackling. In scrutinizing the audio, I thought I might have heard a faint crackle at the start, but in listening and recording more samples, it sounds clean to me. Do you have a recording that you can share?

markbackman avatar Jul 07 '25 19:07 markbackman

Also, I meant to close this one out as this issue is not fixed with 0.0.74.

markbackman avatar Jul 07 '25 19:07 markbackman

I noticed it's fixed when I increase the buffer size. So everything is fine by me now, thank you!

aristid avatar Jul 08 '25 12:07 aristid

@aristid What is the appropriate buffer size? It would be very helpful to others if this information is included in the documentation.

golbin avatar Jul 08 '25 12:07 golbin

I'm not sure what's appropriate. I have a compicated pipeline, the setup may not be typical. If you don't set the buffer size at all, the default ist 0 or "no buffer". The default has no crackling, but it's not optimal for me, because in my use case there could be a hour long conversation where we'd need a lot of RAM. I had the buffer set to 24000 (0.5s) and the assistant voice definitely crackled all the time. After I increased it to 240000 (5s), I don't really notice it. Maybe it just occurs less often. As I said, previous Pipecat version had user voice problems because of resampling, these are gone. What now happens I don't know, there were no problems with assistant voice before 0.74.

aristid avatar Jul 08 '25 14:07 aristid

crackling.zip Here are a few examples with buffer size 24000. These are mp3s, but the original wavs have the same problem. During the conversation there are no problems with crackling.

aristid avatar Jul 08 '25 14:07 aristid

Thank you @aristid !

golbin avatar Jul 08 '25 14:07 golbin