noisereduce icon indicating copy to clipboard operation
noisereduce copied to clipboard

Error occurs when executing 'enhanced_speech = tg(noisy_speech)'

Open Yizai30 opened this issue 1 year ago • 3 comments

Traceback (most recent call last): File "D:\work_directory\Anti-Fraud\audios\scripts\use_noisereduce.py", line 23, in enhanced_speech = tg(noisy_speech) File "D:\environment\Python\3.10.11\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "D:\environment\Python\3.10.11\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "D:\environment\Python\3.10.11\lib\site-packages\noisereduce\torchgate\torchgate.py", line 216, in forward raise Exception(f"x must be bigger than {self.win_length * 2}") Exception: x must be bigger than 2048

How to make it through? I'll be appreciated if anyone could help.

Yizai30 avatar Dec 10 '23 09:12 Yizai30

In my case, the input audio data shape is (513024, 2), and I've solved by swapping the 2 dimensions before processing, then swapping them back after processing.

# swap dimension 0 and 1
print(data.shape)
data = np.swapaxes(data, 0, 1)
print(data.shape)

noisy_speech = torch.from_numpy(data)
noisy_speech = noisy_speech.float().to(device)

# speech processing
enhanced_speech = tg(noisy_speech)

# swap dimension back
print(enhanced_speech.shape)
enhanced_speech = torch.transpose(enhanced_speech, 0, 1)
print(enhanced_speech.shape)

Additionally, I've come into another issue that it generates speech as if it were randomly generated and is accompanied by some of the speaker's original voice. And it has this warning in my console:

UserWarning: Using padding='same' with even kernel lengths and odd dilation may require a zero-padded copy of the input be created (Triggered internally at ..\aten\src\ATen\native\Convolution.cpp:1009.)
  conv1d(

How do I fix it?

Yizai30 avatar Dec 10 '23 10:12 Yizai30

Hi @Yizai30,

Just wanted to let you know that the input format for this function is [batch, audio_length]. For an example, check out this notebook.

We're also aware of the warning you encountered. This is caused by using "same" padding with an even kernel size, please see this issue.

We're working on a fix for this in a future release, but in the meantime, you can adjust the size of the smoothing filter using the freq_mask_smooth_hz and time_mask_smooth_ms parameters.

For nonstationary gating, ensure the n_movemean_nonstationary parameter is set to an odd value.

nuniz avatar Dec 10 '23 20:12 nuniz

I've found one solution/workaround to the problem of not matching the shape after applying noisereduce (implications of UserWarning: Using padding='same' ...). To get the exact shape after using the algorithm:

def audio_padding_before_stft(audio_tensor, hop_length, mode='constant'):
    pad_amount = (hop_length - (audio_tensor.size(-1) % hop_length)) % hop_length
    if pad_amount > 0:
        pad_left = pad_amount // 2
        pad_right = pad_amount - pad_left
        audio_tensor = F.pad(audio_tensor, (pad_left, pad_right), mode=mode)
    return audio_tensor


audio_tensor, sr = ...
tg = TorchGate(sr, ...)
audio_tensor = audio_padding_before_stft(audio_tensor, tg.hop_length)

I'm not sure about the best mode for padding, but I think about this (constant) and reflect (default in stft). The user warning won't disappear, but we got the expected shape in processing.

grzegorz700 avatar Aug 14 '24 09:08 grzegorz700