ast icon indicating copy to clipboard operation
ast copied to clipboard

About AST for Speech Enhancement

Open kaiw7 opened this issue 1 year ago • 5 comments

Hi Dr. Gong, could I know about whether the AST model can be used for speech enhancement task? Especially for testing, each waveform with different length will be fed into the trained model, where the position encoding needs to be applied into different-length waveform.

kaiw7 avatar May 27 '23 02:05 kaiw7

hi,

I don't know much about speech enhancement. It seems to me that MAE model has a larger chance to success. See Appendix C.2 of the AudioMAE Paper.

-Yuan

YuanGongND avatar May 27 '23 05:05 YuanGongND

Hello @kaiw7 and Yuan,

I hate to cut in and share findings from my recent paper, but there's one thing I can share: the typical patch size of 16x16 might be too long duration to capture speech content. 20ms (as in typical speech models) was the best for me. Please find details at https arxiv.org/pdf/2305.14079.pdf (Please copy and paste to complete in your browser. I tried not to auto-link from here.) (This is a paper to specialize a similar SSL audio ViT in speech tasks, though not containing speech enhancement task.)

P.S. Yuan, your new paper "Listen, Think, and Understand" is very very interesting! https://arxiv.org/abs/2305.10790

daisukelab avatar May 27 '23 13:05 daisukelab

hi @daisukelab,

Thanks so much for adding this!! Are you referring to Table 3 of the M2D-S paper? If so, I totally agree your point that 80f x 2t is a more appropraite patch shape than 16x16 for speech, and it is consistent with our experiment in Table 4 of the SSAST Paper (in short, frame-like patch is better for speech tasks, while 16x16 is better for general audio tasks).

And orthogonal to that, MAE for speech enhancement might be an interesting topic.

And thanks so much for your kind words about LTU, the LTU repo contains a interactive demo that you can play with (and also see its limitation).

-Yuan

YuanGongND avatar May 28 '23 07:05 YuanGongND

Hi @YuanGongND,

Thank you for your valuable comment! I found that I totally missed that you already discussed that in Section 3.8 (Comparing Patch-based and Frame-based AST) in the SSAST paper! In addition, you already have tested to compare with speech models in Table 5. Fortunately, I haven't finished my camera ready for Interspeech; I'm hoping to mention what has been done in the SSAST paper. I'll try.

And I'd love to check out the LTU demo!

daisukelab avatar May 28 '23 15:05 daisukelab

hi @daisukelab,

Thanks so much and congrats on your Interspeech paper.

I didn't mean to ask you adding SSAST to the M2D-S paper, I just wanted to say that if two groups independently find the same result (on different models), it is likely to be valid.

-Yuan

YuanGongND avatar May 28 '23 16:05 YuanGongND