ast icon indicating copy to clipboard operation
ast copied to clipboard

[Question] Question about padding operation

Open Mountchicken opened this issue 2 years ago • 3 comments

Hi For audio of different lengths, the padding operation in dataset is taken on fbank. So why not padding on waveform first and then convert it to fbank.

Mountchicken avatar May 29 '22 09:05 Mountchicken

Hi Qing,

Thanks for the question.

I think either way works. The reason that I chose to pad fbank rather than waveform was just to explicitly control the input shape (on the time dimension) of the network.

Is there an advantage you think padding the waveform would have?

-Yuan

YuanGongND avatar May 29 '22 18:05 YuanGongND

Hi @YuanGongND Thanks for the prompt reply.

  • I am interested in speech recognition, but I'm not quite familiar with this field.
  • In CV, we resize and pad images of different sizes to form a batch, and in speech, fbank is also similar to image as it's a 2D tensor, so it is reasonable to pad. But my concern is that fbank is generated from waveform. For waveform, padding is more like to continue recording a bit more sound after the recording is finished. But padding the fbank directly seems to be less intuitive.
  • And I got another naive question, does it make sense to resize fbank to a target size, just like what is done in CV?

Mountchicken avatar May 30 '22 01:05 Mountchicken

In CV, we resize and pad images of different sizes to form a batch, and in speech, fbank is also similar to image as it's a 2D tensor, so it is reasonable to pad. But my concern is that fbank is generated from waveform. For waveform, padding is more like to continue recording a bit more sound after the recording is finished. But padding the fbank directly seems to be less intuitive.

I agree that padding the waveform could be better for the last element of the tensor. But in practice, for a sequence of hundreds of elements, the impact is minor.

And I got another naive question, does it make sense to resize fbank to a target size, just like what is done in CV?

I do not believe so, for the frequency dimension, all samples should be the same, so no need to resize, for the time dimension, resizing means time warping, which is usually undesired.

-Yuan

YuanGongND avatar May 30 '22 08:05 YuanGongND