segan icon indicating copy to clipboard operation
segan copied to clipboard

Guideline on test audio files

Open Mauker1 opened this issue 6 years ago • 8 comments

Hello again!

I've successfully trained the SEGAN using the same database as used on the original paper, and I also managed to test it to enhance an audio file I created using my mic.

But, when I tried to test it on another audio file I had sitting around in my computer, I came across this error:

Loading model weights...
[*] Reading checkpoints...
[*] Read SEGAN-59750
test wave shape:  (4800000,)
test wave min:1.52587890625e-05  max:0.007797360420227051
Traceback (most recent call last):
  File "main.py", line 106, in <module>
    tf.app.run()
  File "C:\Users\mauke\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "main.py", line 97, in main
    c_wave = se_model.clean(wave)
  File "C:\Users\mauke\Documents\git\segan\model.py", line 520, in clean
    x_[:len(x)] = x
ValueError: could not broadcast input array from shape (293,16384) into shape (70,16384)

It seems to me that the test audio file isn't quite what the script was expecting. But I did convert it to a .wav file with 16kHz. So what am I missing? Are there any other requirements for the audio format?

Edit: I've used sox to downsample the audio from 44.1Khz to 16Khz. Same way that was done on the prepare_data.sh script.

Mauker1 avatar Feb 03 '18 16:02 Mauker1

It seems that the problem is related to the audio duration.

The audio I was using is five minutes long. I've cropped it to one minute, and it worked.

Is there a duration limit?

Edit: Yeah, the problem was the duration of the audio. The clean method couldn't handle the audio if it's longer than my batch size of 70 seconds (hence the shape).

Mauker1 avatar Feb 05 '18 16:02 Mauker1

I was using this clean method:

def clean(self, x):
    """ clean a utterance x
        x: numpy array containing the normalized noisy waveform
    """
    # zero pad if necessary
    remainder = len(x) % (2 ** 14)
    if remainder != 0:
        x = np.pad(x, (0, 2**14 - remainder), 'constant', constant_values=0)
    # split files into equal 2 ** 14 sample chunks
    x = np.array(np.array_split(x, int(len(x) / 2 ** 14)))
    x_ = np.zeros((self.batch_size, 2 ** 14))
    x_[:len(x)] = x
    fdict = {self.gtruth_noisy[0]: x_}
    output = self.sess.run(self.Gs[0], feed_dict=fdict)[:len(x)]
    output = output.flatten()
    # remove zero padding if added
    if remainder != 0:
        output = output[:-(2**14 - remainder)]
    return output

Once I switched back to the old "clean" method, it worked on larger files. The only problem is that it got super slow.

Mauker1 avatar Feb 14 '18 18:02 Mauker1

Hey @Mauker1! Yes this is very slow, a dummy implementation (easiest thing that could be done with waste of resources :/) I have another version of this function, that batches many canvases in parallel (the one I used for many posterior experiments).

def clean(self, x):
    """ clean a utterance x
        x: numpy array containing the normalized noisy waveform
    """
    c_res = None
    for beg_i in range(0, x.shape[0], self.canvas_size):
        if x.shape[0] - beg_i  < self.canvas_size:
            length = x.shape[0] - beg_i
            pad = (self.canvas_size) - length
        else:
            length = self.canvas_size
            pad = 0
        x_ = np.zeros((self.batch_size, self.canvas_size))
        if pad > 0:
            x_[0] = np.concatenate((x[beg_i:beg_i + length], np.zeros(pad)))
        else:
            x_[0] = x[beg_i:beg_i + length]
        print('Cleaning chunk {} -> {}'.format(beg_i, beg_i + length))
        fdict = {self.gtruth_noisy[0]:x_}
        canvas_w = self.sess.run(self.Gs[0],
                                 feed_dict=fdict)[0]
        canvas_w = canvas_w.reshape((self.canvas_size))
        print('canvas w shape: ', canvas_w.shape)
        if pad > 0:
            print('Removing padding of {} samples'.format(pad))
            # get rid of last padded samples
            canvas_w = canvas_w[:-pad]
        if c_res is None:
            c_res = canvas_w
        else:
            c_res = np.concatenate((c_res, canvas_w))
    # deemphasize
    c_res = de_emph(c_res, self.preemph)
    return c_res

santi-pdp avatar Feb 26 '18 22:02 santi-pdp

Hi @Mauker1 please i need help with the Loading and Prediction section which is the last section.

I havent been able to figure it out.

"Then the main.py script has the option to process a wav file through the G network (inference mode), where the user MUST specify the trained weights file and the configuration of the trained network." where will the configuration be made and what precisely will i have to alter to make the system work. Thanks

imchukwu avatar Mar 05 '18 04:03 imchukwu

I have solved that issue, but when i tried to test a sample file, my audio file was totally cleaned (couldn't hear any sound). Please what could have been the problem

imchukwu avatar Mar 05 '18 07:03 imchukwu

i tested another sample and it worked fine, thanks... just left for me to test with my own generated wav files

imchukwu avatar Mar 05 '18 08:03 imchukwu

What's your version of python and tensorflow?

tiankong-hut avatar Jun 27 '18 11:06 tiankong-hut

I have been facing a weird issue while testing. I successfully trained the SEGAN model for 19440 iterations for a batch size of 100. During training at the save_freq the max and min values of the generated sample audios are printed. Here, almost all the audio files vary from +0.55.... to -0.5....

Now, during testing for the same audio file in the training set for the same weights, the output behave like this:

test wave min:-0.42119479179382324  max:0.497093141078949
[*] Reading checkpoints...
[*] Read SEGAN-19440
[*] Load SUCCESS
Cleaning chunk 0 -> 16384
gen wave, max:  [0.96146643] min:  [-0.9862874]
inp wave, max:  0.497093141078949 min:  -0.42119479179382324
canvas w shape:  (16384, 1)
Cleaning chunk 16384 -> 32768
gen wave, max:  [0.9773201] min:  [-0.9757471]
inp wave, max:  0.3213702440261841 min:  -0.2770885229110718
canvas w shape:  (16384, 1)
Cleaning chunk 32768 -> 36480
gen wave, max:  [0.99999225] min:  [-0.9999961]
inp wave, max:  0.04255741834640503 min:  -0.041153550148010254
canvas w shape:  (16384, 1)

The generated wav sounds even noisier than before and the speech segments sound extremely loud and distorted. I have no idea why this would be happening? Need some help please.

HusainKapadia avatar Oct 24 '18 09:10 HusainKapadia