PhoneFortifiedPerceptualLoss Can not get the same results of the papar by using your best model

Dear Tsun-An Your paper--"Improving Perceptual Quality by Phone-Fortified Perceptual Loss using Wasserstein Distance for Speech Enhancement," is awesome and amazing. I'm interested in this paper and wanna re-implement it on my machine. But I have some trouble with it. Here are some questions about your great work. I am looking forward to your reply.

1.When loaded your best model provided from this link https://drive.google.com/file/d/1QP2bcmnn1yHybsmUbCj9f0xjUyRvrqJa/view, I get different metric values for whole-utterance enhanced speeches, but the same values for noisy utterances by running the 'generate.py' I wonder if this best model is not updated, here are test results:

whole-utterance enhanced speeches:

Average pesq : 2.846962058139079 Average csig : 3.9676540338137944 Average cbak : 2.9871194883025654 Average covl : 3.4079030461812705 Average stoi : 0.9265973010366857

whole-utterence noisy speeches:

Average pesq : 1.9747030872453764 Average csig : 3.347364207853613 Average cbak : 2.4455921510844965 Average covl : 2.6344466824578245 Average stoi : 0.921240744938685

2.I noticed that speeches are clipped to 16384 samples to save Vram in the training and validation process. If noisy utterances are also clipped in the test process?

3.It is claimed that Wasserstein distance is utilized in your paper. But, I found only L1 loss between c = Φwav2vec (y) and ĉ = Φwav2vec (ŷ) are provided in your GitHub repository. Would you like to tell me how you calculate the Wasserstein distance between c and ĉ without learnable parameters serving as the f function. In my opinion, all learnable parameters are in DCUnet that can not be the k-Lipschitz functions f(c) to the clean. I will be very appreciative if you can reply or send the related codes for me at [email protected].

Thanks a lot, Zack Guo

Jun 27 '21 08:06 zelokuo

Hi, we are glad to have your appreciation.

Sorry for the late update for this repo, and the current version provides the model weights matching the best result reported in our paper.

I've tested it again using the provided weights and still obtain the same result as written in the paper.

Jun 28 '21 09:06 aleXiehta

Would you mind telling me where I can find the evaluation metrics you are using? My results are from pesq in url, and Csig, Cbak, Covl in url

Thanks, again

Jun 29 '21 03:06 zelokuo

Hi, we are glad to have your appreciation.

Sorry for the late update for this repo, and the current version provides the model weights matching the best result reported in our paper.

I've tested it again using the provided weights and still obtain the same result as written in the paper.

Sorry for asking again, I think there are no model weights, only config (.json) files in the link you provide. I thought they are the same as those in your GitHub repository, please check it again.

Thanks a lot

Jun 30 '21 07:06 zelokuo

The link below is the source of evaluation metrics we used: https://ecs.utdallas.edu/loizou/speech/software.htm

I've uploaded the weights, check again please, thx!

Jun 30 '21 12:06 aleXiehta

Hello, I have a similar problem with testing, I notice in the generator.py, one should write the dataset themselves. I want to use the dataset from dataset.py. But in that case, all audio will be cut to one second, is the model able to process variable length audio? And is the score mentioned in the paper based on one second or whole track?

Jul 21 '21 16:07 ruizhecao96

Hello, I have a similar problem with testing, I notice in the generator.py, one should write the dataset themselves. I want to use the dataset from dataset.py. But in that case, all audio will be cut to one second, is the model able to process variable length audio? And is the score mentioned in the paper based on one second or whole track?

You can obtain full-length utterances by removing the truncating in dataset.py during testing.

Jul 25 '21 06:07 aleXiehta

Hello, I have a similar problem with testing, I notice in the generator.py, one should write the dataset themselves. I want to use the dataset from dataset.py. But in that case, all audio will be cut to one second, is the model able to process variable length audio? And is the score mentioned in the paper based on one second or whole track?

You can obtain full-length utterances by removing the truncating in dataset.py during testing.

Thank you very much!

Jul 26 '21 20:07 ruizhecao96

PhoneFortifiedPerceptualLoss PhoneFortifiedPerceptualLoss copied to clipboard

Can not get the same results of the papar by using your best model

PhoneFortifiedPerceptualLoss
PhoneFortifiedPerceptualLoss copied to clipboard