tensorflow-wavenet
tensorflow-wavenet copied to clipboard
Local Conditioning on F0 Working (Kind of)
Here is my branch: https://github.com/ibab/tensorflow-wavenet/compare/master...dannybtran:local-conditioning
Here is the audio: https://soundcloud.com/user-763760918/wavenet-localconditioning-on-f0-of-sin-waves
I'm not sure why there is so much static during the transitions. A couple of guesses are:
- Not enough training data.
- The training data was "pasted" together resulting in very unrealistic "crisp" boundaries between the sin waves.
- Not enough steps. I only did 750
- Label sample frequency is too low (it's on the order of 10hz (10 labels/sec)). Deep Voice mentions their phoneme frequency was 260hz.
Appreciate any feedback.
- Couldn't you generate training data programmatically for something as simple as this? Maybe it's the fact that there are only a few transitions for it to be able to train on?
- I think 750 isn't nearly enough to assess the code
- 10hz would result in an overlap of at least 1600 sample, relative to the upsampling convolution, so this can definetly be a cause of why the transitions sound weird. I think the net might be undecided on which note to play. More training examples might make it develop a "threshold"
That sounds pretty good. What parameters did you use? Regarding the noisy transitions, from our experiments, given enough speech training data, the model produces smooth transitions between vowels when tested on very sharp (artificial) F0 transitions.
I couldn't get your branch to work unfortunatly, I believe the correct way to perform LC is to condition per timestep for a receptive field. so sample at x-n * lc at x-n
Anyone have working LC?
@dannybtran Could u show me how to extract JSON label in your branch plz? Thanks!
@dannybtran Thanks so much for making this issue and explaining what you did. It inspired me to do a similar experiment with my own wavenet implementation in pytorch.
Training data consisted of sequences of random frequencies, and conditioning was created by bucketing the frequencies, and passing into the model which used an embedding to convert them into a vector that was added to each layer before the activation. Code here, feedback welcome! https://colab.research.google.com/drive/1yCW66z7bsSECFp0lboL0Tznml9BsR02B.
It feels awesome to get local conditioning working, even with a toy problem of basic frequency sequences. I'm hoping to figure out how to map linguistic features to a conditioning vector for TTS next.
It would also be interesting to try using conditioning for other kinds of audio problems, like mapping midi -> wav. Midi files contain discrete events as well as alignment/duration info, and the MAESTRO dataset has pairs of recorded midi and wav files. I think it would be straightforward to use midi events as conditioning for a wavenet to synthesize wav.