PercepNet icon indicating copy to clipboard operation
PercepNet copied to clipboard

loss increase and appear nan

Open YangangCao opened this issue 3 years ago • 9 comments

Hi, thanks for your excellent work. I extract feature from speech(pcm, 12GB) and noise(pcm, 9GB), and set count into 10000000. Then, I run run_train.py and get the following output:

WechatIMG444 Can you help me? thanks again!

YangangCao avatar Jul 23 '21 15:07 YangangCao

Can you tell me which dataset you use for training?

jzi040941 avatar Jul 25 '21 12:07 jzi040941

Can you tell me which dataset you use for training?

Hi, I do some changes as follow: Firstly, I add some clean music data into speech since I want to keep the music when denoising. Secondly, speech and noise are resampled and re-codec from no-original 48k wav(such as: 8k 16k mp3). Maybe impact training result?

YangangCao avatar Jul 26 '21 03:07 YangangCao

I use original 48k speech (concatenate into a pcm, 15GB )and noise(concatenate into a pcm, 7.8GB), set count as 10000000, get increasing loss and nan again. When I set count as 100000, I get following output: image image

seem like also increasing loss per iteration but decreasing per epoch, is it normal? When the count is big, the nan seems inevitable.

YangangCao avatar Jul 27 '21 08:07 YangangCao

Hi , I found the problem, the reason of increasing loss is following:

            # print statistics
            running_loss += loss.item()

            # for testing
            print('[%d, %5d] loss: %.3f' %
                    (epoch + 1, i + 1, running_loss))

Actually, I quite don't understand why you write like this...

The reason of nan is CustomLoss

YangangCao avatar Jul 27 '21 11:07 YangangCao

Hi @YangangCao yes I was dumb, I only check for iter=1, epoch=1 that's why I didn't notice this printed loss increasing error on iteration. I fixed on commit 9de28e0

for nan appear error Did you check extracted feature(r,g) are 0~1? if not it will makes nan loss unless you clip it 0~1.

Thanks

jzi040941 avatar Jul 28 '21 05:07 jzi040941

Hi @YangangCao yes I was dumb, I only check for iter=1, epoch=1 that's why I didn't notice this printed loss increasing error on iteration. I fixed on commit 9de28e0

for nan appear error Did you check extracted feature(r,g) are 0~1? if not it will makes nan loss unless you clip it 0~1.

Thanks

I have checked the feature extracted from original 48k wav, they are all range from 0 to 1, including float point number, lots of 0 and sparse 1. When I set the count of extracted feature as 1e5, no nan appears( I tried more than one time). However, When I set as 1e6 and 1e7, nan appears again. I am not sure the relationship between count and nan.

YangangCao avatar Jul 29 '21 13:07 YangangCao

Code has error in 'rnn_train.py' so that loss is nan.

rb = targets[:,:,:34]
gb = targets[:,:,34:68]

but in 'denoise.cpp':

fwrite(g, sizeof(float), NB_BANDS, f3);//gain    
fwrite(r, sizeof(float), NB_BANDS, f3);//filtering strength

rb < 0, so that torch.pow(gb, 0.5) is nan

You should change code in 'rnn_train.py' to:

gb = targets[:,:,:34]
rb = targets[:,:,34:68]

Chen1399 avatar Sep 07 '21 03:09 Chen1399

Code has error in 'rnn_train.py' so that loss is nan.

Thanks I've fix in #24

jzi040941 avatar Sep 08 '21 04:09 jzi040941

There is a new question for 'loss nan'. The feature of pitch correlation could be 'nan'. The value 'error' could be zero in the file named 'celt_lpc.cpp', which make pitch correlation be nan. ''' r = -SHL32(rr,3)/error; ''' You can add a bias to 'error' which can make 'error' not be zero. ''' r = -SHL32(rr,3)/(error + 0.00001); '''

Chen1399 avatar Sep 29 '21 06:09 Chen1399