tensorflow-wavenet icon indicating copy to clipboard operation
tensorflow-wavenet copied to clipboard

Get the speaker id out from wave file

Open arpitbaheti opened this issue 7 years ago • 5 comments

Hi,

Does anyone know how we can use wave-net implementation to actually return the speaker id on giving wave file as input? Instead of generating the wave file for a given speaker.

Thanks, Arpit

arpitbaheti avatar Mar 09 '17 06:03 arpitbaheti

@arpitbaheti In the original paper deepmind mentions using wavenet for speech recognition. Try using that architecture to do any sort of classification tasks on your raw waveforms. It's basically an avg pool layer and then a bunch of normal conv layers on top. We tried it for F0 estimation and it worked really good

belevtsoff avatar Mar 26 '17 20:03 belevtsoff

Thanks @belevtsoff for your answer. In the original paper they said that " adding a mean pooling layer after dilated convolutions that aggregated the activations to coarser frames spanning 10 ms (160x down-sampling). " So what exactly they have done with average pooling (reducing the input dimension to particular value)? I have tried the same architecture with average pool1d on skip outputs followed by two conv1d layer. After that softmax for target speaker (which is single integer represent the id of the speaker) and predicted output. Problem is size of logit and target is not matching. you said you have tried F0 estimation, can you please let me know how did you do that?

arpitbaheti avatar Apr 10 '17 04:04 arpitbaheti

I have the same question. Does the initiator have solved the problem?

haoeryue avatar Jun 21 '17 02:06 haoeryue

I have tried many things, but as wavenet works on samples and we can't predict speaker per samples, but i have modified https://github.com/buriburisuri/speech-to-text-wavenet to return the speaker ID with MFCC as input feature, Which doesn't work well, Any other network known to work for speaker recognition?? (Any RNN/LSTM)

arpitbaheti avatar Jun 23 '17 06:06 arpitbaheti

@arpitbaheti All right. Actually, I wanna try to classify or identify the audio samples by using time-domain features directly rather than other transferred features such as MFCCs, Spectrogram, etc. Have you ever tried some other methods to do it (Just deal with waveform just like wavenet)?

haoeryue avatar Jun 29 '17 13:06 haoeryue