DNN-for-speech-enhancement icon indicating copy to clipboard operation
DNN-for-speech-enhancement copied to clipboard

Is it convenient for you to share the pretrain model with me

Open wangjianfly2003 opened this issue 7 years ago • 15 comments

Hi Dr.Xu,

Is it convenient for you to share the pretrain model with me?

wangjianfly2003 avatar Jul 27 '17 07:07 wangjianfly2003

Hi, the initialized model was not pre-trained. Just with random initialization.

yongxuUSTC avatar Jul 27 '17 13:07 yongxuUSTC

ok, it is here: https://github.com/yongxuUSTC/DNN-for-speech-enhancement/tree/master/toolbox/weights

source code for initializing your model weights randomly and change back the weights for matlab decoding

yongxuUSTC avatar Jul 27 '17 13:07 yongxuUSTC

Thank you very much for your kindly reply, Dr.Xu. You means i don't need to do the pretrain process, and can get the speech enhancement effect like you provide in DNN_speech_enhancement_tool using only the fine tune process?

wangjianfly2003 avatar Jul 28 '17 05:07 wangjianfly2003

Yes, correct. Just with fine-tuning process with random initialization. I once tried RBM-based pre-training which did not work.

yongxuUSTC avatar Jul 28 '17 08:07 yongxuUSTC

OK. i will try to train an new model with collected noisy data using the fine-tuning process you provide. Thank you very much.

From reading your decoding code, i guess you use noisy speech and noisy data as input feature, use clean speech and noisy as output feature to train the model you provide. Am i right? Besides, you use the normalized “timit_aurora4_115NT_7SNRS_each190_80uuts_noisy_lsp_be_random_linux_global_mv.mat” file to deal with the input noisy speech , however, i don't understand why you use this file to do DNN decoding , why not use the normalized output feature to do decoding?

wangjianfly2003 avatar Jul 28 '17 09:07 wangjianfly2003

The direct mapping is from noisy speech log-power spectra to clean speech log-power spectra. Additionally, you can also predict noise log-power spectra, ideal binary mask, or ideal ratio mask to do some post-processing.

The norm file is used both for training and decoding. In the decoding, you should normalize the input noisy feature, and transform the enhanced feature back to the normal scale using the norm file.

yongxuUSTC avatar Jul 29 '17 11:07 yongxuUSTC

where do you find "“timit_aurora4_115NT_7SNRS_each190_80uuts_noisy_lsp_be_random_linux_global_mv.mat”" ?

I think i use a different one: https://drive.google.com/file/d/0B5r5bvRpQ5DRR1lIV1hpZ0RLQ0E/view

yongxuUSTC avatar Jul 29 '17 11:07 yongxuUSTC

Hi Dr.XU. I made a mistake about the norm file. I tried the same norm file as you used.

In the "BP_GPU.cu" file, i think the code should be modified as below to make the output unit is linear, that is changed the second parameter from "cur_layer_y" to "cur_layer_x". cudaMemcpy(dev[0].out,cur_layer_x,n_framescur_layer_unitssizeof(float),cudaMemcpyDeviceToDevice);

Am i right?

wangjianfly2003 avatar Jul 31 '17 06:07 wangjianfly2003

You are right. cudaMemcpy(dev[0].out,cur_layer_x,n_framescur_layer_unitssizeof(float),cudaMemcpyDeviceToDevice);

I think i uploaded the code for ideal binary mask prediction. I commented the sigmoid code, but forgot to change "cur_layer_y" to "cur_layer_x".

I have updated the code.

yongxuUSTC avatar Jul 31 '17 09:07 yongxuUSTC

please update "cv_bunch_single" func also

yongxuUSTC avatar Jul 31 '17 09:07 yongxuUSTC

Hi Dr.Xu. Today i used noisy speech log-power spectra as input feature (50 TIMIT clean speech corrupted with 100 enviroment noise type with -5db SNR), clean speech log-power spectra as target feature to train the model, the learning rate is 0.0005, the layersize is 2827(257*11),2048,2048,2048,257, the weights is random initialization; the number of epoch is 35(the value of squared_err is decreased).Then i use the trained model to to decoding, but got a very poor effect, even can't hear the speech.

Could you tell me how to determine the cause of the problem?

the size of training set is too small? the decoding error is wrong? ...

wangjianfly2003 avatar Aug 01 '17 08:08 wangjianfly2003

Could you update the your "finetune_DNN_speech_enhancement_dropout_NAT.pl", "interface.cc" and "step1_DNNenh_for 16kHz.m" files for direct mapping model from noisy speech log-power spectra to clean speech log-power spectra. I think i only changed the above three files.

wangjianfly2003 avatar Aug 01 '17 09:08 wangjianfly2003

If you want to check your code, you can map from clean to clean, if it still does not work. That means your code has some problem. You should do inverse-fea-norm as i did in step1_DNNenh_for 16kHz.m. Please ref "step1_DNNenh_for 16kHz.m" for decoding. There is no problem in the decoding code.

yongxuUSTC avatar Aug 01 '17 19:08 yongxuUSTC

Hi Dr,Xu. I mapped from clean to clean, it seems it still does not work. So i started to check the code, and found that the map from 11 frames of input feature to one frame of target feature is correct, but the input data of frame 5 and frame 10 in para->indata are the same , i also checked the frame 5 and frame 10 in dataori, which are not the same. So i think maybe there are something wrong in the following code: for(j =0; j<= cur_frame_of_sent - para->fea_context;j++){ for(i =0;i< para->fea_context;i++){ for(k=0;k< para->fea_dim;k++){ para->indata[sample_index[cur_sample]* para->layersizes[0] +k +i *para->fea_dim] = dataori[(frames_processed +j +i) (2+para->fea_dim) +k+2]; } } I think the sentence "para->indata[sample_index[cur_sample] para->layersizes[0] +k +i *para->fea_dim] = dataori[(frames_processed +j +i) (2+para->fea_dim) +k+2];" should be changed to "para->indata[sample_index[cur_sample] para->layersizes[0] +k +i para->fea_dim] = dataori[(frames_processed +j para->fea_context +i) *(2+para->fea_dim) +k+2]; Am i right?

wangjianfly2003 avatar Aug 02 '17 07:08 wangjianfly2003

i comment the following code in interface.cc file: /* i=i-1; for(k=129;k< 2*(para->fea_dim);k++){ para->indata[sample_index[cur_sample]* para->layersizes[0] +k +i *para->fea_dim] = (dataori[(frames_processed + 0) *(2+para->fea_dim) +(k-129)+2]+dataori[(frames_processed + 1) *(2+para->fea_dim) +(k-129)+2]+dataori[(frames_processed + 2) *(2+para->fea_dim) +(k-129)+2]+dataori[(frames_processed + 3) *(2+para->fea_dim) +(k-129)+2]+dataori[(frames_processed + 4) *(2+para->fea_dim) +(k-129)+2]+dataori[(frames_processed + 5) *(2+para->fea_dim) +(k-129)+2])/6.0f; } */

wangjianfly2003 avatar Aug 02 '17 08:08 wangjianfly2003