audio
audio copied to clipboard
Hardcoded padding mode in functional.spectrogram
I wonder why the padding mode is hardcoded in functional.spectrogram? Maybe add a parameter to support reflect padding?
https://github.com/pytorch/audio/blob/main/torchaudio/functional/functional.py#L117
Hi @gormat
This is an old implementation and the context is lost, so we are not sure why it's implemented this way. From our understanding, adding extra padding like this does not make much sense (especially with the default argument where stft adds reflection padding)
What behavior do you think is ideal?
First, I think it is confusing to have two arguments for the function one called pad
and another pad_mode
which refer to entirely different things (pad
stands for tow-sided padding of the signal and pad_mode
is for stft
).
Next, migrating from different libraries to torchaudio causes numerical inconsistency due to fixed reflect padding, especially when processing short signals.
@mthrok Maybe the reason why this was implemented this way is because torch.stft
discards trailing samples if they do not fit a frame. This is specially noticeable when using center=False
as pointed out in pytorch/pytorch#70073 and in #3003.
According to @nateanl in pytorch/pytorch#70073, this is "mitigated by defining the length
argument in torch.istft
, which pads zero tensors after the inverted signal". But to me it makes more sense to pad the input of torch.stft
such that we have an integer number of frames (which might not need to be done on both sides), then analyze-synthesize, and only then trim if needed to match the original input length. It's less "lossy", and this might be why the two types of padding in torchaudio.functional.spectrogram
.
There is still the question of whether the padding modes should match. Currently the first padding mode is hardcoded to "constant"
as pointed out by OP.