noisereduce
noisereduce copied to clipboard
Error occurs when executing 'enhanced_speech = tg(noisy_speech)'
Traceback (most recent call last):
File "D:\work_directory\Anti-Fraud\audios\scripts\use_noisereduce.py", line 23, in
How to make it through? I'll be appreciated if anyone could help.
In my case, the input audio data shape is (513024, 2), and I've solved by swapping the 2 dimensions before processing, then swapping them back after processing.
# swap dimension 0 and 1
print(data.shape)
data = np.swapaxes(data, 0, 1)
print(data.shape)
noisy_speech = torch.from_numpy(data)
noisy_speech = noisy_speech.float().to(device)
# speech processing
enhanced_speech = tg(noisy_speech)
# swap dimension back
print(enhanced_speech.shape)
enhanced_speech = torch.transpose(enhanced_speech, 0, 1)
print(enhanced_speech.shape)
Additionally, I've come into another issue that it generates speech as if it were randomly generated and is accompanied by some of the speaker's original voice. And it has this warning in my console:
UserWarning: Using padding='same' with even kernel lengths and odd dilation may require a zero-padded copy of the input be created (Triggered internally at ..\aten\src\ATen\native\Convolution.cpp:1009.)
conv1d(
How do I fix it?
Hi @Yizai30,
Just wanted to let you know that the input format for this function is [batch, audio_length]. For an example, check out this notebook.
We're also aware of the warning you encountered. This is caused by using "same" padding with an even kernel size, please see this issue.
We're working on a fix for this in a future release, but in the meantime, you can adjust the size of the smoothing filter using the freq_mask_smooth_hz and time_mask_smooth_ms parameters.
For nonstationary gating, ensure the n_movemean_nonstationary parameter is set to an odd value.
I've found one solution/workaround to the problem of not matching the shape after applying noisereduce (implications of UserWarning: Using padding='same' ...
). To get the exact shape after using the algorithm:
def audio_padding_before_stft(audio_tensor, hop_length, mode='constant'):
pad_amount = (hop_length - (audio_tensor.size(-1) % hop_length)) % hop_length
if pad_amount > 0:
pad_left = pad_amount // 2
pad_right = pad_amount - pad_left
audio_tensor = F.pad(audio_tensor, (pad_left, pad_right), mode=mode)
return audio_tensor
audio_tensor, sr = ...
tg = TorchGate(sr, ...)
audio_tensor = audio_padding_before_stft(audio_tensor, tg.hop_length)
I'm not sure about the best mode for padding, but I think about this (constant) and reflect (default in stft). The user warning won't disappear, but we got the expected shape in processing.