tensorflow-wavenet icon indicating copy to clipboard operation
tensorflow-wavenet copied to clipboard

Global condition and Local conditioning

Open thomasmurphycodes opened this issue 8 years ago • 68 comments

In the white paper, they mention conditioning to a particular speaker as an input they condition globally, and the TTS component as an up-sampled (deconvolution) conditioned locally. For the latter, they also mention that they tried just repeating the values, but found it worked less well than doing the deconvolutions.

Is there effort underway to implement either of these? Practically speaking, implementing the local conditioning would allow us to begin to have this implementation speak recognizable words.

thomasmurphycodes avatar Sep 29 '16 18:09 thomasmurphycodes

Yeah, it's definitely a planned feature. I'll get to it eventually, but I'd also accept contributions if someone is interested. A solution to this should also integrate with the AudioReader interface.

ibab avatar Sep 29 '16 20:09 ibab

Is somebody working on this already?

Zeta36 avatar Oct 08 '16 06:10 Zeta36

I'm starting to work on it, I think I can get some basic implementation working over the next couple of days. Global part should be easy, and a dumb implementation (upsampling by repeating values) of local conditioning should be fast to implement as well.

This way, we can get to a stage where the net can produces some low-quality speech. Then work on improving the quality by adding more sophisticated upsampling methods.

The white paper also talks about using local conditioning features beyond just the text data, they do some preprocessing to compute phonetic features from the text. That would be nice to add later as well.

alexbeloi avatar Oct 08 '16 17:10 alexbeloi

I agree global will be easier, should just be a one-hot vector representing the speaker. Am I thinking about this wrong that the local conditioning requires us to train on data sets that contain the phonetic data as a feature vector in addition to the waveform feature? What dataset are you thinking of using?

On Sat, Oct 8, 2016 at 1:56 PM, Alex Beloi [email protected] wrote:

I'm starting to work on it, I think I can get some basic implementation working over the next couple of days. Global part should be easy, and a dumb implementation (upsampling by repeating values) of local conditioning should be fast to implement as well.

This way, we can get to a stage where the net can produces some low-quality speech. Then work on improving the quality by adding more sophisticated upsampling methods.

The white paper also talks about using local conditioning features beyond just the text data, they do some preprocessing to compute phonetic features from the text. That would be nice to add later as well.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/112#issuecomment-252438852, or mute the thread https://github.com/notifications/unsubscribe-auth/AEKqWYd6YWpFgmjPuTd12PWhrGHD5bdVks5qx9ligaJpZM4KKVOa .

thomasmurphycodes avatar Oct 08 '16 18:10 thomasmurphycodes

I was thinking to just use the raw text from the corpus data for local conditioning to start, just encode each character into a vector and upsample (by repeats) it to the number of samples in the audio file, not ideal but it's a start. Characters should be able to act as a really rough proxy for phonetic features.

Ideally, the raw text should be processed (perhaps via some other model) into a sequence of phonetic features and then that would be upsampled to the size of the audio sample.

alexbeloi avatar Oct 08 '16 18:10 alexbeloi

I mean let's give it a shot and see what happens. Google Research has a bunch of papers over on there page about HMM-ing characters to phonemes, so we could look into a subproject where we try to implement that.

On Sat, Oct 8, 2016 at 2:25 PM, Alex Beloi [email protected] wrote:

I was thinking to just use the raw text from the corpus data for local conditioning to start, just encode each character into a vector and upsample (by repeats) it to the number of samples in the audio file, not ideal but it's a start. Characters should be able to act as a really rough proxy for phonetic features.

Ideally, the raw text should be processed (perhaps via some other model) into a sequence of phonetic features and then that would be upsampled to the size of the audio sample.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/112#issuecomment-252440212, or mute the thread https://github.com/notifications/unsubscribe-auth/AEKqWZmBL3E8ObZmo44eoP_ht2lqW8Ufks5qx-AGgaJpZM4KKVOa .

thomasmurphycodes avatar Oct 08 '16 18:10 thomasmurphycodes

@thomasmurphycodes Could you post the list of papers?

nakosung avatar Oct 09 '16 02:10 nakosung

Yeah will tomorrow when in the office, they're on a box I have there.

On Sat, Oct 8, 2016 at 10:38 PM, Nako Sung [email protected] wrote:

@thomasmurphycodes https://github.com/thomasmurphycodes Could you post the list of papers?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/112#issuecomment-252460549, or mute the thread https://github.com/notifications/unsubscribe-auth/AEKqWQBOgH-ifllSzpZuJVXOafNVXramks5qyFOTgaJpZM4KKVOa .

thomasmurphycodes avatar Oct 09 '16 17:10 thomasmurphycodes

I've also thought about just plugging in the raw text, but I'm pretty sure we would need at least some kind of attention mechanism if we want it to work properly (i.e. some way for the network to figure out which parts of the text correspond to which sections of the waveform).

ibab avatar Oct 10 '16 11:10 ibab

I think that's the case for sure. They explicitly mention the convolution up-sampling (zero-padding) in the paper.

On Mon, Oct 10, 2016 at 7:30 AM, Igor Babuschkin [email protected] wrote:

I've also thought about just plugging in the raw text, but I'm pretty sure we would need at least some kind of attention mechanism if we want it to work properly (i.e. some way for the network to figure out which parts of the text correspond to which sections of the waveform).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/112#issuecomment-252593215, or mute the thread https://github.com/notifications/unsubscribe-auth/AEKqWdC5vND-v17ykcajGuKc95ry3MD0ks5qyiHbgaJpZM4KKVOa .

thomasmurphycodes avatar Oct 10 '16 17:10 thomasmurphycodes

In #92 HMM-aligned phonetic features are already provided. The upsampling/repeating values step is for going from feature vector per HMM frame to feature vector per time-domain sample..

wuaalb avatar Oct 11 '16 07:10 wuaalb

found: Merlin online, anyone use their training data here: CMU_ARCTIC datasets as linguistic features to train the wavenet?

rockyrmit avatar Oct 11 '16 21:10 rockyrmit

(from one of the WaveNet co-authors): Linguistic features which we used were similar to those listed in this document. https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/F0parametrisation/hts_lab_format.pdf

AFAIK there is no publicly available large TTS speech database containing linguistic features :-( So TTS research community (especially universities) often uses small ones.

One candidate is CMU ARCTIC databases with HTS demo. CMU ARCTIC has 4 US English speakers (about 1 hour per speaker). It is distributed with phoneme-level segmentations. HTS demo shows how to extract other linguistic features (described in the above-mentioned documents) from raw texts using festival. If you have any TTS experts / PhD researchers around, they can be familiar with how to use festival / HTS-demo.

let me know if anyone want to start to working on the linguistic features and local condition.

rockyrmit avatar Oct 13 '16 03:10 rockyrmit

I think that it is very important for this project not to die, that somebody public or share already his deployment about the local or global conditioning (even if it is unfinished). I'm afraid this project can get stuck in the current state if no one give a new step.

I've done my best but I'm afraid I have no the equipment (no GPU) nor the knowledge to do much more than what I've already done.

Zeta36 avatar Oct 14 '16 09:10 Zeta36

@Zeta36, @ibab Apologies for the delays, the local/global conditioning has been taking a bit longer than expected.

I can push my progress to my fork by tonight. What I have right now runs, though for some reason training stalls at exactly iteration 116 (i.e. the process will not continue to the next iteration, despite default num_steps = 4000)

One of the main time sinks is that it takes a long time to train and then generate wav files to check if the conditioning is doing anything at all. No real way around that.

alexbeloi avatar Oct 14 '16 16:10 alexbeloi

Possibly a memory overhead issue? Or is it converging?

Sent from my iPhone

On Oct 14, 2016, at 11:24, Alex Beloi [email protected] wrote:

@Zeta36, @ibab Apologies for the delays, the local/global conditioning has been taking a bit longer than expected.

I can push my progress my fork by tonight. What I have right now runs, though for some reason training stalls at exactly iteration 116 (i.e. the process will not continue to the next iteration, despite default num_steps = 4000)

One of the main time sinks is that it takes a long time to train and then generate wav files to check if the conditioning is doing anything at all. No real way around that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

thomasmurphycodes avatar Oct 14 '16 16:10 thomasmurphycodes

I figured out the issue with that, it was related to the filereader and queue, I created a second queue for the text files and was dequeuing text/audio together, but they became mismatched over time because of the audio slicing.

alexbeloi avatar Oct 14 '16 17:10 alexbeloi

@alexbeloi

they became mismatched over time because of the audio slicing.

That sounds good. I'm glad you stumbled over that tripwire before I got to it :P.

I fear we may have duplicated some effort, but you are ahead of me. I hadn't got to the audio reader part yet. I've spent most of the time building out model_test.py so that we can test training and "speaker id"-conditioned generation. So perhaps we can combine your global conditioning with my test, or pick the better parts of both.

Have you by any chance incorporated speaker shuffling in your audio reader changes? I think we're going to need that so you might keep it in mind as you write that code, if not implement in the first PR.

jyegerlehner avatar Oct 14 '16 18:10 jyegerlehner

@jyegerlehner

The shuffling has been in the back of my mind. I haven't worked on it yet, definitely needs to get implemented at some point for the data to be closer to IID.

@ibab and all I've caught up my changes with upstream/master and pushed it to my fork. So far I have the model and training part done for both global and local conditioning but not the generation. I haven't been able to verify that the conditioning is working since I haven't gotten the generation working yet.

I want to clean it up more and modularize the embedding/upsampling before making a PR but if anyone wants to hack away at it in parallel, feel free.

https://github.com/alexbeloi/tensorflow-wavenet

running the following will train model with global conditioning as speaker_id from the VCTK corpus data, and local conditioning from the corresponding text data. python train.py --vctk

The way I've implemented it, the conditioning gets applied to each dilation layer (not just the initial one), it's not clear to me from the paper if that's the intended method.

alexbeloi avatar Oct 14 '16 23:10 alexbeloi

@alexbeloi I'm contemplating working from your branch, and adding my test on top of it. Looking at your branch I notice at a few things:

https://github.com/alexbeloi/tensorflow-wavenet/blob/master/wavenet/model.py#L560

Here I was using tf.nn.embedding_lookup to go from the integer that specifies "speaker_id", not tf.one_hot.

compactness I think one problem with using tf.one-hot instead tf.embedding_lookup is its effect on the size of the 'gcond_filter' and 'gcond_gate' parameter tensors. These occur in every dilation layer. And the size of each is global_condition_channels x dilation channels. When using tf.one_hot, global_condition_channels = the number of mutually exclusive categories, whereas with tf.embedding_lookup, the global_condition_channels specifies the embedding size, and can be chosen independently of the number of mutually-exclusive categories. This might be a size 16 or 32 embedding, as opposed to a size 109 vector (to cover the speakers in VCTK corpus).

generality Another problem is generality: one might wish to do global coditioning where there isn't an enumeration of mutually exclusive categories upon which one is conditioning. Your approach works fine where there are only 109 speakers in the VCTK corpus, but what if one wishes to condition upon some embedding vector produced by, say, seq2seq. Or a context stack (2.2 in the paper). I don't think the number of possible character sequences that correspond to valid sentences in a language could feasibly be enumerated. But you can produce a dense embedding vector of fixed size (say, 1000) that represents any sentence. The h in the equation at the bottom of page 4 in paper can be any vector you want to condition on, but with the tf.one_hot it can only be an input to the WaveNetModel as an integer enumerating all possible values.

local conditioning: separate PR? I think it's usually good practice to break up large changes into smaller ones, so as not to try to "eat the elephant" all in one sitting. Each of global and local conditioning is complicated enough a change I think they are better in separate PRs. I'd suggest putting them in their own named branches rather than your master.

local conditioning: hard-wired to strings https://github.com/alexbeloi/tensorflow-wavenet/blob/master/wavenet/model.py#L566

I'm guessing your use of tf.string_to_hash_bucket_fast() is intended to process linguistic features (which come as strings? I don't really know). But the paper also mentions local conditioning for context stacks (section 2.6), which will not be strings, but a dense embedding vector y in equation at the top of page 5.

local conditioning: upsampling/deconvolution Your tf.image.resize_images I think does what they said doesn't work as well (page 5, last paragraph of 2.5) I think this needs to be a strided transpose convolution (AKA deconvolution).

So in short, I think what I'm proposing is that global_condition vector h and local_condition vector y come into the WaveNetModel class as dense vectors of any size from any source, and that any encoding (e.g. tf.one_hot or tf.nn.embedding_lookup) be done outside the WaveNetModel. Then, when we're working with VCTK we can do one_hot or embedding_lookup to produce global_condition, but when we're dealing with other things that produce a dense vector we can accommodate that too.

I think the approach you are taking works as long as all we care about is the VCTK corpus (or a few music genres) without context stacks. But context stacks are definitely on my road map so prefer not to see local conditioning hard-wired to strings.

Maybe the wider community is happy with your approach and if so perhaps they can speak up.

BTW these are my initial thoughts; I often miss things and am very persuadable.

jyegerlehner avatar Oct 16 '16 06:10 jyegerlehner

@alexbeloi you are doing a great job!!

I have replicated my text WaveNet implementation (#117) but using your model modifications for global and local conditioning. Well, after training the model using texts in Spanish and English (being ID = 1 the Spanish texts and ID=2 the English ones), I could generate later text in any language independently by using the parameter --speaker_id equals to 1 or 2!!

This mean that your global conditioning is working perfectly!!

Stay working on it!!

I would like to mention one thing about your code. In the AudioReader, when we iterate after reading the audio, when we cut the audio into buffers of self.sample_sizes, the ID and the text sometimes start to mixing badly.

Imagine for example, that we read from a folder with 5 wav files and that the load_vctk_audio() returns a tuple with the audio raw data, the ID of the speaker, and the plain text. If we set self.sample_sizes to None then everything works fine because we'll feed sample_placefolder, id_placeholder and text_placeholder correctly (this whole audio raw is feed in the sample_holder in a time). But, and this is important, if we set a sample_sizes, then the audio is going to be cut and the id and the text in some cases start to mix badly, and the placeholders start to be fed incorrectly: where for example a sample_holder is feed with raw data from two different wav files and ID and text being badly informed.

I had this problem with my text prove, where in some moments I had the sample_holder with both Spanish and English texts at the same time.

Zeta36 avatar Oct 16 '16 09:10 Zeta36

@jyegerlehner Thanks for the feedback, I agree with everything you've pointed out. My plan was to do hacky vctk specific embeddings, get the math right, then go back and replace with more generic embeddings/upsampling.

@Zeta36 Thanks verifying some of the things work! I'll have to look at what you say regarding the sample_size. I thought the way I had it, it was queuing the same global condition for each piece that is sliced and queued from the sample.

alexbeloi avatar Oct 16 '16 17:10 alexbeloi

@alexbeloi, imagine you have 5 wav files each with a different size. You yield in load_vctk_audio() the audio raw vector, the speaker id and the text the wav is saying. If you fill the sample_holder, the id holder and the text holder in one time (self.sample_size equals to None) all is correct. But if you set a sample_size to cut in pieces, you have this problem:

  1. We have 5 raw audio and we start the first iteration with the buffer_ clean. We append then to the buffer the first audio raw vector and we cut the first piece of buffer_ with a certain sample size, after what we feed the three holders. We then repeat again and cut another piece of sample size and feed again.

  2. We repeat this process while len(buffer_) > self.sample_size, so when after cutting a piece it results in len(buffer_) being less or equal to self.sample_size we ignore this last piece (this is the real problem) and we restart the loop with a new audio raw file and a new speaker id and text but now we have NOT the buffer_ clean as in the first bucle, but it now it has the remain piece of the last audio raw as we have seen.

In other words, when we start cutting an audio vector, the last piece will be ignored and will stay in the buffer_ to the next iteration. This is not a mayor problem when we are working without conditioning as until now, but it cannot stay in this way with conditioning, because in the second iteration you begin to mix audio raw data from different speakers and text.

A fast solution will be simply cleaning the buffer_ at the beginning of each iteration in the line right after: for audio, extra in iterator: using buffer_ = np.array([])

This would be a solution but this will ignore the last piece of every audio file what may be not a good idea.

Regards, Samu.

Zeta36 avatar Oct 16 '16 18:10 Zeta36

@Zeta36 Ah, I see now.

If we don't want to drop the tail piece of audio, we can pad the it with silent noise, queue it, and have the buffer cleaned as you suggest. Or the functionality (between dropping and padding) can be determined by whether silence_threshold is set or not.

alexbeloi avatar Oct 16 '16 21:10 alexbeloi

@alexbeloi Good job! (1) I notice your code for using local_condition: conv_filter = conv_filter +
causal_conv(local_condition, weights_lcond_filter, dilation) I think the local_condition doesn't need to do dilation, it is just a 1x1 conv, and doesn't need to do causal_conv, just conv1d is OK. So what is your consideration here? (2)

The way I've implemented it, the conditioning gets applied to each dilation layer (not just the initial one), it's not clear to me from the paper if that's the intended method.

I think what you done is the intended mothod. Vg,k*y means that every layer(k is the layer index) has seperate weights.

sonach avatar Oct 17 '16 03:10 sonach

Exciting thread here!

fwiw: I re-recorded one of the entries from VCTK in my own voice, and got decent babble results. I used the VCTK txt and recording style with the intention of later training on the full corpus + my voice in the mix. Planning to do more recordings, and I'd be happy to do them in a way that helps generate data with linguistic features (by adding markup myself, and/or reading passages designed with them in mind, etc). I might be able to find some others to help on this as well. Let me know if any of this would be useful!

chrisnovello avatar Oct 17 '16 06:10 chrisnovello

That's a great idea Chris. I wonder if we could create an expanded multi-speaker set on the VCTK text within this project.

On Mon, Oct 17, 2016 at 2:59 AM, Chris Novello [email protected] wrote:

Exciting thread here!

fwiw: I re-recorded one of the entries from VCTK in my own voice, and got decent babble results https://soundcloud.com/paperkettle/wavenet-babble-test-trained-a-neural-network-to-speak-with-my-voice. I used the VCTK txt and recording style with the intention of later training on the full corpus + my voice in the mix. Planning to do more recordings, and I'd be happy to do them in a way that helps generate data with linguistic features (by adding markup myself, and/or reading passages designed with them in mind, etc). I might be able to find some others to help on this as well. Let me know if any of this would be useful!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/112#issuecomment-254127463, or mute the thread https://github.com/notifications/unsubscribe-auth/AEKqWaRdSW95O2ZdKquFYekvxd5ZakV6ks5q0xzMgaJpZM4KKVOa .

thomasmurphycodes avatar Oct 17 '16 11:10 thomasmurphycodes

@alexbeloi Hi, I used your code to train VCTK. But when I tried to generate a wav file, I got an error. This is the way I used the generate.py file: python generate.py --wav_out_path=out.wav --speaker_id=2 --speaker_text='hello world' --samples=16000 --logdir=./logdir/train/2016-10-18T12-35-15 ./logdir/train/2016-10-18T12-35-15/model.ckpt-2000

And I got the error: Shape must be rank 2 but is rank 3 for 'wavenet_1/dilated_stack/layer0/MatMul_6' (op: 'MatMul') with input shapes: [?,?,32], [32,32].

Did i miss something? Thank you.

linVdcd avatar Oct 18 '16 08:10 linVdcd

@lin5547 Hi, thanks for testing things. You haven't missed anything, the generation part is still a work-in-progress unfortunately, I'm looking to have things working by the end of the week.

@sonach You're right, the paper says this should be just a 1x1 conv, will make the change.

alexbeloi avatar Oct 18 '16 17:10 alexbeloi

@alexbeloi

The way I've implemented it, the conditioning gets applied to each dilation layer (not just the initial one), it's not clear to me from the paper if that's the intended method.

I discuss this with an ASR expert. In speaker adaption application, the speaker ID vector is applied to every layer instead of the first layer only. So your implemention should be OK:)

sonach avatar Oct 20 '16 09:10 sonach