FullSubNet icon indicating copy to clipboard operation
FullSubNet copied to clipboard

Does the Pretrained Model available in releases folder works with 48k sampling rate?

Open yugeshav opened this issue 4 years ago • 14 comments

@haoxiangsnr

Hello,

FullSubNet model works with 48k sampling rate in inferencing time?

Regards Yugesh

yugeshav avatar Feb 22 '21 06:02 yugeshav

Hi, you need to downsample to 16K first.

haoxiangsnr avatar Feb 23 '21 09:02 haoxiangsnr

Hi, you need to downsample to 16K first.

Does your model has any option to resample the audio data?

yugeshav avatar Feb 23 '21 10:02 yugeshav

Maybe you could use sox for resampling. Here is an example of how to do it:

sox filename.wav -r 16000 filename_16000.wav

Check this link for more info: https://stackoverflow.com/questions/23980283/sox-resample-and-convert

haoxiangsnr avatar Feb 23 '21 11:02 haoxiangsnr

Sorry, I think you can directly use the FullSubNet model to enhance the 48K wav file in inferencing time.

Check this line of the project. When loading, Librosa will resample the wav file to 16K, regardless of the original sampling rate.

However, you should note that after enhancement, the saved wav file is 16K.

haoxiangsnr avatar Feb 23 '21 11:02 haoxiangsnr

Sorry, I think you can directly use the FullSubNet model to enhance the 48K wav file in inferencing time.

Check this line of the project. When loading, Librosa will resample the wav file to 16K, regardless of the original sampling rate.

However, you should note that after enhancement, the saved wav file is 16K.

Thanks for the details, I tried inferencing 48k audio file and saved output in 16k, but observed quality of the speech is completely missed, sometimes no speech also. Is this expected behavior of your model?

yugeshav avatar Feb 23 '21 13:02 yugeshav

Could you please send me the wav file and the inference config?

haoxiangsnr avatar Feb 24 '21 00:02 haoxiangsnr

Could you please send me the wav file and the inference config?

Input file uploaded in this link [https://drive.google.com/file/d/1UVejws8QuAtDWuA3cyCU6nMNp1Gv2E-L/view?usp=sharing]

Code changes are in config/inference/fullsubnet.toml

inherit = "config/common/fullsubnet_inference.toml" [dataset] path = "dataset.DNS_INTERSPEECH_inference.Dataset" [dataset.args] noisy_dataset = "/root/data_3tb_2/Experiments_Yugesh/Yugesh_FSN/FullSubNet-main/rc14_48k" limit = false offset = 0 sr = 48000

In src/inferencer/DNS_INTERSPEECH.py Line 162

op_dir = "/root/data_3tb_2/Experiments_Yugesh/Yugesh_FSN/FullSubNet-main/outputs" op_dir = op_dir + '/'+name+'.wav' sf.write(op_dir, enhanced, samplerate=16000)

yugeshav avatar Feb 24 '21 04:02 yugeshav

You will get the correct result by changing sr = 48000 to sr = 16000 in the inference/fullsubnet.toml, I presume?

Considering that sr = 48000, Librosa will load wav files by resampling the original sampling rate (in your case, 48K) to 48K (means no change). However, the pred-trained model is for wav files with 16K.

If you set sr = 16000, Librosa will load wav files by resampling the original sampling rate (in this case, 48K) to 16K.

haoxiangsnr avatar Feb 24 '21 07:02 haoxiangsnr

You will get the correct result by changing sr = 48000 to sr = 16000 in the inference/fullsubnet.toml, I presume?

Considering that sr = 48000, Librosa will load wav files by resampling the original sampling rate (in your case, 48K) to 48K (means no change). However, the pred-trained model is for wav files with 16K.

If you set sr = 16000, Librosa will load wav files by resampling the original sampling rate (in this case, 48K) to 16K.

Okay, Then fullsubnet model only able to process 16k inputs. if we give 48k then librosa will take care of resampling conversion???

Thanks a lot for the detailed info @haoxiangsnr

yugeshav avatar Feb 24 '21 09:02 yugeshav

@yugeshav can you share the pretrained model ?

ahmedbahaaeldin avatar Mar 10 '21 08:03 ahmedbahaaeldin

The pre-trained model is in here: https://github.com/haoxiangsnr/FullSubNet/releases

On Wed, Mar 10, 2021, 2:08 PM ahmedbahaaeldin [email protected] wrote:

@yugeshav https://github.com/yugeshav can you share the pretrained model ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/haoxiangsnr/FullSubNet/issues/7#issuecomment-795088846, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHCASOR5LNQYQUONDLSU4DTTC4VXJANCNFSM4X74AECQ .

yugeshav avatar Mar 10 '21 10:03 yugeshav

@yugeshav which one from the archive/data file should i pick for the best performance ?

ahmedbahaaeldin avatar Mar 10 '21 11:03 ahmedbahaaeldin

As per the author, it is fullsubnet.

On Wed, Mar 10, 2021, 5:19 PM ahmedbahaaeldin [email protected] wrote:

@yugeshav https://github.com/yugeshav which one from the archive/data file should i pick for the best performance ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/haoxiangsnr/FullSubNet/issues/7#issuecomment-795303936, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHCASOVHRFM3YFNALIJBQZLTC5ME5ANCNFSM4X74AECQ .

yugeshav avatar Mar 10 '21 12:03 yugeshav

@yugeshav I changed the input to 16k sample rate , reshaped it to (1,1,257,-1) and forward through the network , the output shape is (1,2,257,-1) , is this the correct way to use it , cause the sound output is noise ? or their should be some preprocessing ?? @haoxiangsnr

ahmedbahaaeldin avatar Mar 10 '21 12:03 ahmedbahaaeldin